卡夫卡流并发? [英] Kafka Streaming Concurrency?

查看:44
本文介绍了卡夫卡流并发?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些基本的 Kafka Streaming 代码,可以从一个主题读取记录,进行一些处理,然后将记录输出到另一个主题.

I have some basic Kafka Streaming code that reads records from one topic, does some processing, and outputs records to another topic.

Kafka 流如何处理并发?一切都在一个线程中运行吗?我在文档中没有看到这一点.

How does Kafka streaming handle concurrency? Is everything run in a single thread? I don't see this mentioned in the documentation.

如果是单线程,我想要多线程处理选项来处理大量数据.

If it's single threaded, I would like options for multi-threaded processing to handle high volumes of data.

如果它是多线程的,我需要了解它是如何工作的以及如何处理资源,例如 SQL 数据库连接应该在不同的处理线程中共享.

If it's multi-threaded, I need to understand how this works and how to handle resources, like SQL database connections should be shared in different processing threads.

相对于其他选项(Spark、Akka、Samza、Storm 等),是否不建议将 Kafka 的内置流 API 用于高容量场景?

Is Kafka's built-in streaming API not recommended for high volume scenarios relative to other options (Spark, Akka, Samza, Storm, etc)?

推荐答案

2020 年 10 月更新:我写了一个 关于 Kafka 基础知识的四部分博客系列我建议阅读此类问题.对于这个问题,请查看 关于处理基础知识的第 3 部分.

Update Oct 2020: I wrote a four-part blog series on Kafka fundamentals that I'd recommend to read for questions like these. For this question in particular, take a look at part 3 on processing fundamentals.

针对您的问题:

Kafka 流如何处理并发?一切都在一个线程中运行吗?我在文档中没有看到这一点.

How does Kafka streaming handle concurrency? Is everything run in a single thread? I don't see this mentioned in the documentation.

这在 http://docs.confluent 中有详细记录.io/current/streams/architecture.html#parallelism-model.我不想逐字复制粘贴到这里,但我想强调一下,恕我直言,要理解的关键元素是 partitions(参见 Kafka 的主题分区,在 Kafka Streams 中泛化为流分区",因为并非所有正在处理的数据流都将通过 Kafka),因为分区当前决定了 Kafka(代理/服务器端)和使用 Kafka Streams API 的流处理应用程序的并行性(客户端).

This is documented in detail at http://docs.confluent.io/current/streams/architecture.html#parallelism-model. I don't want to copy-paste this here verbatim, but I want to highlight that IMHO the key element to understand is that of partitions (cf. Kafka's topic partitions, which in Kafka Streams is generalized to "stream partitions" as not all data streams that are being processed will be going through Kafka) because a partition is currently what determines the parallelism of both Kafka (the broker/server side) and of stream processing applications that use the Kafka Streams API (the client side).

如果是单线程,我想要多线程处理选项来处理大量数据.

If it's single threaded, I would like options for multi-threaded processing to handle high volumes of data.

处理分区将始终由单个线程"完成仅,这可确保您不会遇到并发问题.但是...

Processing a partition will always be done by a single "thread" only, which ensures you are not running into concurrency issues. But...

如果它是多线程的,我需要了解它是如何工作的以及如何处理资源,例如 SQL 数据库连接应该在不同的处理线程中共享.

If it's multi-threaded, I need to understand how this works and how to handle resources, like SQL database connections should be shared in different processing threads.

...因为 Kafka 允许一个主题有多个分区,所以你可以并行处理.例如,如果一个主题有 100 个分区,那么最多 100 个流任务(或者,有些过于简化:最多 100 台不同的机器每台运行一个应用程序实例)可以并行处理该主题.同样,每个流任务都将获得对 1 个分区的独占访问权限,然后将对其进行处理.

...because Kafka allows a topic to have many partitions, you get parallel processing. For example, if a topic has 100 partitions, then up to 100 stream tasks (or, somewhat over-simplified: up to 100 different machines each running an instance of your application) may process that topic in parallel. Again, every stream task would get exclusive access to 1 partition, which it would then process.

相对于其他选项(Spark、Akka、Samza、Storm 等),是否不建议将 Kafka 的内置流 API 用于高容量场景?

Is Kafka's built-in streaming API not recommended for high volume scenarios relative to other options (Spark, Akka, Samza, Storm, etc)?

Kafka 的流处理引擎是绝对推荐的,并且在实践中也实际用于大容量场景.比较基准测试的工作仍在进行中,但在许多情况下,基于 Kafka Streams 的应用程序会更快.请参阅 LINE 工程师的博客:将 Kafka Streams 应用于内部消息传递管道 中的文章LINE Corp 是亚洲最大的社交平台之一(超过 2.2 亿用户),他们描述了他们如何在生产中使用 Kafka 和 Kafka Streams API 来每秒处理数百万个事件.

Kafka's stream processing engine is definitely recommended and also actually being used in practice for high-volume scenarios. Work on comparative benchmarking is still being done, but in many cases a Kafka Streams based application turns out to be faster. See LINE engineer's blog: Applying Kafka Streams for internal message delivery pipeline for an article by LINE Corp, one of the largest social platforms in Asia (220M+ users), where they describe how they are using Kafka and the Kafka Streams API in production to process millions of events per second.

这篇关于卡夫卡流并发?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆