Kinesis Streams和Flink [英] Kinesis Streams and Flink

查看:386
本文介绍了Kinesis Streams和Flink的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于在Kinesis流中分片数据的问题.在将用户数据发送到我的运动流时,我想使用随机分区键,以便分片中的数据均匀分布.为了简化此问题,我想通过在Flink应用程序中键入userId来聚合用户数据.

I have a question regarding sharding data in a Kinesis stream. I would like to use a random partition key when sending user data to my kinesis stream so that the data in the shards is evenly distributed. For the sake of making this question simpler, I would then like to aggregate the user data by keying off of a userId in my Flink application.

我的问题是:如果分片是随机分区的,那么一个userId的数据分布在多个Kinesis分片上,Flink可以处理读取的多个分片,然后重新分发数据,以便单个userId的所有数据流到同一聚合器任务?或者,在Flink消费运动流之前,我是否需要按用户ID对其进行分片?

My question is this: if the shards are randomly partitioned so that data for one userId is spread across multiple Kinesis shards, can Flink handle reading off of multiple shards and then redistributing the data so that all of the data for a single userId is streamed to the same aggregator task? Or, do I need to shard the kinesis stream by user id before it is consumed by Flink?

推荐答案

... Flink可以处理多个分片的读取,然后重新分发数据,以便将单个userId的所有数据流式传输到同一聚合器任务吗?

... can Flink handle reading off of multiple shards and then redistributing the data so that all of the data for a single userId is streamed to the same aggregator task?

如果使用Flink的DataStream API,则keyBy(e -> e.userId)的作用是重新分发所有事件,以便将任何特定userId的所有事件都流式传输到同一下游聚合器任务.

The effect of a keyBy(e -> e.userId), if you use Flink's DataStream API, is to redistribute all of the events so that all events for any particular userId will be streamed to the same downstream aggregator task.

每个主机都会从流中的一部分分片中读取数据吗,然后Flink会使用keyBy运算符将具有相同密钥的消息传递给将执行实际聚合的主机吗?

Would each host read in data from a subset of the shards in the stream and would Flink then use the keyBy operator to pass messages of the same key to the host that will perform the actual aggregation?

是的,没错.

例如,如果您有8个物理主机,每个物理主机提供8个用于运行作业的插槽,则将有64个聚合任务的实例,每个实例将负责键空间的不相交的子集.

If, for example, you have 8 physical hosts, each providing 8 slots for running the job, then there will be 64 instances of the aggregator task, each of which will be responsible for a disjoint subset of the key space.

假设有64个以上的分片可供读取,那么在64个任务中的每个分片中,源将从一个或多个分片中读取数据,然后根据其userId分配读取的事件.假设userIds均匀地分布在各个分片上,则每个源实例都会发现它读取的一些事件是针对分配给它处理的userIds的,应该使用本地聚合器.其余事件每个都需要发送到其他63个聚合器之一,具体取决于哪个工作人员负责每个userId.

Assuming there are more than 64 shards available to read from, then each in each of the 64 tasks, the source will read from one or more shards, and then distribute the events it reads according to their userIds. Assuming the userIds are evenly spread across the shards, then each source instance will find that a few of the events it reads are for userIds it is been assigned to handle, and the local aggregator should be used. The rest of the events will each need to be sent out to one of the other 63 aggregators, depending on which worker is responsible for each userId.

这篇关于Kinesis Streams和Flink的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆