Kafka流并发? [英] Kafka Streaming Concurrency?

查看:76
本文介绍了Kafka流并发?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些基本的Kafka Streaming代码,可以从一个主题读取记录,进行一些处理,然后将记录输出到另一个主题.

I have some basic Kafka Streaming code that reads records from one topic, does some processing, and outputs records to another topic.

Kafka流如何处理并发?一切都在单个线程中运行吗?我没有在文档中看到这一点.

How does Kafka streaming handle concurrency? Is everything run in a single thread? I don't see this mentioned in the documentation.

如果是单线程,我希望多线程处理的选项可以处理大量数据.

If it's single threaded, I would like options for multi-threaded processing to handle high volumes of data.

如果它是多线程的,我需要了解它是如何工作的以及如何处理资源,例如应该在不同的处理线程中共享SQL数据库连接.

If it's multi-threaded, I need to understand how this works and how to handle resources, like SQL database connections should be shared in different processing threads.

相对于其他选项(Spark,Akka,Samza,Storm等),不建议将Kafka的内置流API应用于大容量场景吗?

Is Kafka's built-in streaming API not recommended for high volume scenarios relative to other options (Spark, Akka, Samza, Storm, etc)?

推荐答案

Kafka流如何处理并发?一切都在单个线程中运行吗?我没有在文档中看到这一点.

How does Kafka streaming handle concurrency? Is everything run in a single thread? I don't see this mentioned in the documentation.

有关详细信息,请参见 http://docs.confluent. io/current/streams/architecture.html#parallelism-model .我不想一字不漏地在此处复制粘贴,但我想强调一点,恕我直言,要理解的关键元素是 partitions (参见Kafka的主题分区,在Kafka Streams中被概括为流分区",因为并非所有正在处理的数据流都将通过Kafka),因为分区当前决定了Kafka(代理/服务器端)和使用Kafka Streams API的流处理应用程序的并行性(客户端).

This is documented in detail at http://docs.confluent.io/current/streams/architecture.html#parallelism-model. I don't want to copy-paste this here verbatim, but I want to highlight that IMHO the key element to understand is that of partitions (cf. Kafka's topic partitions, which in Kafka Streams is generalized to "stream partitions" as not all data streams that are being processed will be going through Kafka) because a partition is currently what determines the parallelism of both Kafka (the broker/server side) and of stream processing applications that use the Kafka Streams API (the client side).

如果是单线程,我希望多线程处理的选项可以处理大量数据.

If it's single threaded, I would like options for multi-threaded processing to handle high volumes of data.

处理一个分区将始终仅由单个线程"完成,这确保您不会遇到并发问题.但是...

Processing a partition will always be done by a single "thread" only, which ensures you are not running into concurrency issues. But...

如果它是多线程的,我需要了解它是如何工作的以及如何处理资源,例如应该在不同的处理线程中共享SQL数据库连接.

If it's multi-threaded, I need to understand how this works and how to handle resources, like SQL database connections should be shared in different processing threads.

...因为Kafka允许一个主题具有多个分区,所以您可以进行并行处理.例如,如果一个主题有100个分区,那么最多可以并行处理该主题,而该任务最多可以包含100个流任务(或者有些过于简化:每个最多可以运行100个不同的计算机,每个计算机都运行应用程序的一个实例).同样,每个流任务将获得对1个分区的独占访问权,然后将对其进行处理.

...because Kafka allows a topic to have many partitions, you get parallel processing. For example, if a topic has 100 partitions, then up to 100 stream tasks (or, somewhat over-simplified: up to 100 different machines each running an instance of your application) may process that topic in parallel. Again, every stream task would get exclusive access to 1 partition, which it would then process.

相对于其他选项(Spark,Akka,Samza,Storm等),不建议将Kafka的内置流API应用于大容量场景吗?

Is Kafka's built-in streaming API not recommended for high volume scenarios relative to other options (Spark, Akka, Samza, Storm, etc)?

绝对推荐使用Kafka的流处理引擎,并且实际上已将其实际用于大容量场景.比较基准测试的工作仍在进行中,但是在许多情况下,基于Kafka Streams的应用程序却变得更快.请参阅 LINE工程师的博客:将Kafka Streams用于内部消息传递管道,该文章的作者LINE Corp是亚洲最大的社交平台之一(有2.2亿多用户),他们在其中描述了如何在生产中使用Kafka和Kafka Streams API每秒处理数百万个事件.

Kafka's stream processing engine is definitely recommended and also actually being used in practice for high-volume scenarios. Work on comparative benchmarking is still being done, but in many cases a Kafka Streams based application turns out to be faster. See LINE engineer's blog: Applying Kafka Streams for internal message delivery pipeline for an article by LINE Corp, one of the largest social platforms in Asia (220M+ users), where they describe how they are using Kafka and the Kafka Streams API in production to process millions of events per second.

这篇关于Kafka流并发?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆