Kafka Streams 如何处理包含不完整数据的分区? [英] How does Kafka Streams work with Partitions that contain incomplete Data?

查看:25
本文介绍了Kafka Streams 如何处理包含不完整数据的分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Kafka Streams 引擎将一个分区映射到一个 worker(即 Java 应用程序),以便该分区中的所有消息都由该 worker 处理.我有以下场景,并试图了解它是否仍然可行.

Kafka Streams engine maps a partition to exactly one worker (i.e. Java App), so that all messages in that partition are processed by that worker. I have the following scenario, and am trying to understand if it is still feasible for it to work.

我有一个主题 A(有 3 个分区).发送给它的消息由 Kafka 随机分区(即没有密钥).我发送给它的消息的架构如下

I have a Topic A (with 3 partitions). The messages sent to it are partitioned randomly by Kafka (i.e. there is no key). The message I send to it has a schema like below

{carModel: "Honda", color: "Red", timeStampEpoch: 14334343342}

由于我有 3 个分区,并且消息在它们之间随机分区,相同型号的汽车可以写入不同的分区.例如

Since I have 3 partitions, and the messages are partitioned randomly across them, cars of the same model could be written to different partitions. For example

P1
{carModel: "Honda", color: "Red", timeStampEpoch: 14334343342}
{carModel: "Honda", color: "Red", timeStampEpoch: 14334343342}
{carModel: "Toyota", color: "Blue", timeStampEpoch: 14334343342}

P2
{carModel: "Toyota", color: "Red", timeStampEpoch: 14334343342}
{carModel: "Honda", color: "Red", timeStampEpoch: 14334343342}
{carModel: "Nissan", color: "Blue", timeStampEpoch: 14334343342}

P3
{carModel: "Nissan", color: "Red", timeStampEpoch: 14334343342}
{carModel: "Honda", color: "Red", timeStampEpoch: 14334343342}
{carModel: "Nissan", color: "Blue", timeStampEpoch: 14334343342}

现在假设我想计算carModel 看到的汽车总数.我编写了一个 Kafka Streams 应用程序,它侦听主题 A,按 carModel 映射消息,即

Now let's say I wanted to count the total number of cars seen by carModel. I write a Kafka Streams application that listens to topic A, maps messages by carModel, i.e.

carStream.map((key, value) -> KeyValue.pair(value["carModel"], value))

并将总数写入另一个主题B,形式为

and writes the total to another topic B, a message of the form

{carModel: "Nissan", totalCount: 5}

然后我启动了它的 3 个实例,它们都属于同一消费者组.然后 Kafka 将有效地将每个分区映射到其中一个工作人员.示例

I then launch 3 instances of it, all part of the same Consumer Group. Kafka would then efficiently map each partition to one of the workers. Example

P1 --> Worker A
P2 --> Worker B
P3 --> Worker C

然而,由于每个 Worker 只能看到 1 个分区,因此它只能看到每个车型的部分信息.它会丢失来自其他分区的相同车型的数据.

However, since each Worker only sees 1 partition then it will only see partial information for each car model. It will miss data for the same car model from other partitions.

问题:我的理解是否正确?

如果是这样,我可以想象我可以通过 carModel 重新分区(即重新洗牌)我的数据,以便此用例工作.

If it is, I can imagine that I could re-partition (i.e. reshuffle) my data by carModel for this use case to work.

但我只是想确保我没有误解这是如何工作的,事实上,在我的应用程序中的内部映射之后,Kafka 确实以某种方式神奇地处理了重新分区.

But I just want to make sure I'm not misunderstanding how this works, and in fact Kafka does somehow magically take care of the re-partitioning after my internal mapping in my application.

推荐答案

Kafka Streams 将自动对您的数据进行重新分区.您的程序将类似于:

Kafka Streams will do the repartitioning of your data automatically. Your program will be something like:

stream.map(...).groupByKey().count();

对于这种模式,Kafka Streams 检测到您在 map 中设置了一个新键,因此会在后台自动创建一个主题来重新分配 groupByKey().count() 步骤(从 v0.10.1 开始,通过 KAFKA-3561).

For this pattern, Kafka Streams detects that you set a new key in map and thus will create a topic automatically in the background to repartition the data for the groupByKey().count() step (as of v0.10.1 via KAFKA-3561).

注意,map()标记"需要重新分区的流,.groupByKey().count() 将创建重新分区的主题.在这方面,重新分区是惰性的",即仅在需要时才进行.如果没有 .groupByKey().count() 就不会引入重新分区.

Note, map() "marks" the stream that it requires repartitioning and .groupByKey().count() will create the topic for repartitioning. With this regard, repartitioning is "lazy", i.e., it is only done if required. If there is no .groupByKey().count() there would be no repartitioning introduced.

基本上,上面的程序的执行方式与

Basically, the program from above is executed in the same way as

stream.map(...).through("some-topic").groupByKey().count();

Kafka Streams 自动插入"through() 步骤,从而计算出正确的结果.

Kafka Streams automatically "insert" the through() step and thus computes the correct result.

如果您使用的是 Kafka Streams 0.10.0,则需要使用所需的分区数手动创建重新分区主题,并且需要将调用 through() 添加到您的代码中,也是.

If you are using Kafka Streams 0.10.0, you will need to create the repartition topic manually with the desired number of partitions and you will need to add the call to through() to your code, too.

这篇关于Kafka Streams 如何处理包含不完整数据的分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆