Kinesis Streams 和 Flink [英] Kinesis Streams and Flink

查看:29
本文介绍了Kinesis Streams 和 Flink的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于在 Kinesis 流中分片数据的问题.我想在将用户数据发送到我的 kinesis 流时使用随机分区键,以便分片中的数据均匀分布.为了使这个问题更简单,我想通过在我的 Flink 应用程序中关闭 userId 来聚合用户数据.

I have a question regarding sharding data in a Kinesis stream. I would like to use a random partition key when sending user data to my kinesis stream so that the data in the shards is evenly distributed. For the sake of making this question simpler, I would then like to aggregate the user data by keying off of a userId in my Flink application.

我的问题是:如果分片是随机分区的,以便一个 userId 的数据分布在多个 Kinesis 分片上,Flink 是否可以处理读取多个分片,然后重新分配数据,以便单个 userId 的所有数据流到同一个聚合器任务?或者,我是否需要在 Flink 使用 kinesis 流之前按用户 id 对其进行分片?

My question is this: if the shards are randomly partitioned so that data for one userId is spread across multiple Kinesis shards, can Flink handle reading off of multiple shards and then redistributing the data so that all of the data for a single userId is streamed to the same aggregator task? Or, do I need to shard the kinesis stream by user id before it is consumed by Flink?

推荐答案

... Flink 能否处理读取多个分片,然后重新分配数据,以便将单个 userId 的所有数据流式传输到同一个聚合器任务?

... can Flink handle reading off of multiple shards and then redistributing the data so that all of the data for a single userId is streamed to the same aggregator task?

keyBy(e -> e.userId) 的作用,如果你使用 Flink 的 DataStream API,是重新分配所有事件,以便任何特定 userId 的所有事件都将被流式传输到相同的下游聚合器任务.

The effect of a keyBy(e -> e.userId), if you use Flink's DataStream API, is to redistribute all of the events so that all events for any particular userId will be streamed to the same downstream aggregator task.

每个主机是否会从流中的分片子集中读取数据,然后 Flink 会使用 keyBy 运算符将相同密钥的消息传递给将执行实际聚合的主机?

Would each host read in data from a subset of the shards in the stream and would Flink then use the keyBy operator to pass messages of the same key to the host that will perform the actual aggregation?

是的,没错.

例如,如果您有 8 个物理主机,每个主机提供 8 个插槽用于运行作业,那么将有 64 个聚合器任务实例,每个实例将负责密钥空间的一个不相交子集.

If, for example, you have 8 physical hosts, each providing 8 slots for running the job, then there will be 64 instances of the aggregator task, each of which will be responsible for a disjoint subset of the key space.

假设有超过 64 个可供读取的分片,那么在 64 个任务的每一个中,源将从一个或多个分片中读取,然后根据它们的用户 ID 分发它读取的事件.假设 userIds 均匀分布在分片上,那么每个源实例都会发现它读取的一些事件是针对它被分配处理的 userIds,并且应该使用本地聚合器.其余事件都需要发送到其他 63 个聚合器之一,具体取决于哪个工作人员负责每个用户 ID.

Assuming there are more than 64 shards available to read from, then each in each of the 64 tasks, the source will read from one or more shards, and then distribute the events it reads according to their userIds. Assuming the userIds are evenly spread across the shards, then each source instance will find that a few of the events it reads are for userIds it is been assigned to handle, and the local aggregator should be used. The rest of the events will each need to be sent out to one of the other 63 aggregators, depending on which worker is responsible for each userId.

这篇关于Kinesis Streams 和 Flink的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆