Kafka Streams如何与包含不完整数据的分区一起使用? [英] How does Kafka Streams work with Partitions that contain incomplete Data?

查看:134
本文介绍了Kafka Streams如何与包含不完整数据的分区一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Kafka Streams引擎将一个分区恰好映射到一个工作程序(即Java App),以便该工作程序处理该分区中的所有消息.我有以下情况,并且试图了解它是否仍然可行.

Kafka Streams engine maps a partition to exactly one worker (i.e. Java App), so that all messages in that partition are processed by that worker. I have the following scenario, and am trying to understand if it is still feasible for it to work.

我有一个主题A(带有3个分区).发送给它的邮件由Kafka随机分区(即没有密钥).我发送给它的消息具有如下所示的模式

I have a Topic A (with 3 partitions). The messages sent to it are partitioned randomly by Kafka (i.e. there is no key). The message I send to it has a schema like below

{carModel: "Honda", color: "Red", timeStampEpoch: 14334343342}

由于我有3个分区,并且消息在它们之间是随机分区的,因此可以将同一型号的汽车写入不同的分区.例如

Since I have 3 partitions, and the messages are partitioned randomly across them, cars of the same model could be written to different partitions. For example

P1
{carModel: "Honda", color: "Red", timeStampEpoch: 14334343342}
{carModel: "Honda", color: "Red", timeStampEpoch: 14334343342}
{carModel: "Toyota", color: "Blue", timeStampEpoch: 14334343342}

P2
{carModel: "Toyota", color: "Red", timeStampEpoch: 14334343342}
{carModel: "Honda", color: "Red", timeStampEpoch: 14334343342}
{carModel: "Nissan", color: "Blue", timeStampEpoch: 14334343342}

P3
{carModel: "Nissan", color: "Red", timeStampEpoch: 14334343342}
{carModel: "Honda", color: "Red", timeStampEpoch: 14334343342}
{carModel: "Nissan", color: "Blue", timeStampEpoch: 14334343342}

现在让我们说我要计算 carModel看到的汽车总数.我编写了一个Kafka Streams应用程序,该应用程序侦听主题A,并按carModel映射消息,即

Now let's say I wanted to count the total number of cars seen by carModel. I write a Kafka Streams application that listens to topic A, maps messages by carModel, i.e.

carStream.map((key, value) -> KeyValue.pair(value["carModel"], value))

并将总数写到另一个主题B,消息形式为

and writes the total to another topic B, a message of the form

{carModel: "Nissan", totalCount: 5}

然后我启动它的3个实例,这些实例属于相同消费者组.然后,Kafka将有效地将每个分区映射到其中一个工作程序.例子

I then launch 3 instances of it, all part of the same Consumer Group. Kafka would then efficiently map each partition to one of the workers. Example

P1 --> Worker A
P2 --> Worker B
P3 --> Worker C

但是,由于每个工作人员只能看到1个分区,因此每个汽车型号只能看到部分信息.它将错过来自其他分区的相同车型的数据.

However, since each Worker only sees 1 partition then it will only see partial information for each car model. It will miss data for the same car model from other partitions.

问题:我的理解正确吗?

如果是这样,我可以想象我可以通过carModel重新分区(即重新排列)我的数据,以便该用例可以正常工作.

If it is, I can imagine that I could re-partition (i.e. reshuffle) my data by carModel for this use case to work.

但是我只是想确保我不会误解它是如何工作的,事实上,Kafka确实以某种方式神奇地照顾了我在应用程序中进行内部映射之后的重新分区.

But I just want to make sure I'm not misunderstanding how this works, and in fact Kafka does somehow magically take care of the re-partitioning after my internal mapping in my application.

推荐答案

Kafka Streams将自动对数据进行重新分区.您的程序将类似于:

Kafka Streams will do the repartitioning of your data automatically. Your program will be something like:

stream.map(...).groupByKey().count();

对于这种模式,Kafka Streams检测到您在map中设置了新密钥,因此将在后台自动创建一个主题,以重新分配groupByKey().count()步骤的数据(从v0.10.1开始,通过 KAFKA-3561 ).

For this pattern, Kafka Streams detects that you set a new key in map and thus will create a topic automatically in the background to repartition the data for the groupByKey().count() step (as of v0.10.1 via KAFKA-3561).

请注意,map()标记"需要重新分区的流,并且.groupByKey().count()将创建要重新分区的主题.就此而言,重新分区是惰性的",即仅在需要时才进行.如果没有.groupByKey().count(),将不会引入重新分区.

Note, map() "marks" the stream that it requires repartitioning and .groupByKey().count() will create the topic for repartitioning. With this regard, repartitioning is "lazy", i.e., it is only done if required. If there is no .groupByKey().count() there would be no repartitioning introduced.

基本上,上面的程序的执行方式与

Basically, the program from above is executed in the same way as

stream.map(...).through("some-topic").groupByKey().count();

Kafka Streams自动插入" through()步骤,从而计算出正确的结果.

Kafka Streams automatically "insert" the through() step and thus computes the correct result.

如果您使用的是Kafka Streams 0.10.0,则需要使用所需的分区数手动创建重新分区主题,并且还需要将对through()的调用添加到代码中.

If you are using Kafka Streams 0.10.0, you will need to create the repartition topic manually with the desired number of partitions and you will need to add the call to through() to your code, too.

这篇关于Kafka Streams如何与包含不完整数据的分区一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆