Kafka 主题分区到 Spark 流 [英] Kafka topic partitions to Spark streaming

查看:33
本文介绍了Kafka 主题分区到 Spark 流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于 Kafka 主题分区 -> 火花流资源利用,我有一些用例需要更清楚地说明.

I have some use cases that I would like to be more clarified, about Kafka topic partitioning -> spark streaming resource utilization.

我使用 spark 独立模式,所以我只有执行程序总数"和执行程序内存"的设置.据我所知,根据文档,将并行性引入 Spark 流的方法是使用分区的 Kafka 主题 -> 当我使用 spark-kafka 直接流集成时,RDD 将具有与 kafka 相同数量的分区.

I use spark standalone mode, so only settings I have are "total number of executors" and "executor memory". As far as I know and according to documentation, way to introduce parallelism into Spark streaming is using partitioned Kafka topic -> RDD will have same number of partitions as kafka, when I use spark-kafka direct stream integration.

因此,如果主题中有 1 个分区和 1 个执行程序核心,则该核心将依次从 Kafka 读取.

So if I have 1 partition in the topic, and 1 executor core, that core will sequentially read from Kafka.

如果我有会发生什么:

  • 主题中有 2 个分区并且只有 1 个执行程序核心?该核心会先从一个分区读取然后再从第二个分区读取,因此对主题进行分区没有任何好处吗?

  • 2 partitions in the topic and only 1 executor core? Will that core read first from one partition and then from the second one, so there will be no benefit in partitioning the topic?

主题中有 2 个分区和 2 个核心?那么 1 个执行器核心会从 1 个分区读取,第二个核心从第二个分区读取吗?

2 partitions in the topic and 2 cores? Will then 1 executor core read from 1 partition, and second core from the second partition?

1 个 kafka 分区和 2 个执行程序核心?

1 kafka partition and 2 executor cores?

谢谢.

推荐答案

基本规则是您可以向上扩展 Kafka 分区的数量.如果您将 spark.executor.cores 设置为大于分区数,则部分线程将处于空闲状态.如果它小于分区数,Spark 将从一个分区读取线程,然后从另一个分区读取.所以:

The basic rule is that you can scale up to the number of Kafka partitions. If you set spark.executor.cores greater than the number of partitions, some of the threads will be idle. If it's less than the number of partitions, Spark will have threads read from one partition then the other. So:

  1. 2 个分区,1 个执行程序:先从一个分区读取,然后再读取另一个.(我不确定 Spark 如何决定在切换前从每个读取多少)

  1. 2 partitions, 1 executor: reads from one partition then then other. (I am not sure how Spark decides how much to read from each before switching)

2p、2c:并行执行

1p, 2c:一个线程空闲

1p, 2c: one thread is idle

对于案例#1,请注意分区数多于执行程序是可以的,因为它允许您稍后扩展而无需重新分区.诀窍是确保您的分区可以被执行程序的数量整除.在将数据传递到管道的下一步之前,Spark 必须处理所有分区.因此,如果您有剩余"分区,这会减慢处理速度.例如,5 个分区和 4 个线程 => 处理需要 2 个分区的时间 - 一次 4 个,然后一个线程自己运行第 5 个分区.

For case #1, note that having more partitions than executors is OK since it allows you to scale out later without having to re-partition. The trick is to make sure that your partitions are evenly divisible by the number of executors. Spark has to process all the partitions before passing data onto the next step in the pipeline. So, if you have 'remainder' partitions, this can slow down processing. For example, 5 partitions and 4 threads => processing takes the time of 2 partitions - 4 at once then one thread running the 5th partition by itself.

另请注意,如果通过在诸如 reduceByKey() 之类的函数中显式设置数据分区的数量,使整个管道中的分区/RDD 数量保持不变,您可能还会看到更好的处理吞吐量.

Also note that you may also see better processing throughput if you keep the number of partitions / RDDs the same throughout the pipeline by explicitly setting the number of data partitions in functions like reduceByKey().

这篇关于Kafka 主题分区到 Spark 流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆