Kafka主题分区到Spark流 [英] Kafka topic partitions to Spark streaming

查看:111
本文介绍了Kafka主题分区到Spark流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于Kafka主题划分->火花流资源利用,我有一些用例需要进一步阐明.

I have some use cases that I would like to be more clarified, about Kafka topic partitioning -> spark streaming resource utilization.

我使用Spark独立模式,所以只有执行者总数"和执行者内存"是我的设置.据我所知,根据文档,将并行性引入Spark流的方法是使用分区的Kafka主题->当我使用spark-kafka直接流集成时,RDD将具有与kafka相同的分区数.

I use spark standalone mode, so only settings I have are "total number of executors" and "executor memory". As far as I know and according to documentation, way to introduce parallelism into Spark streaming is using partitioned Kafka topic -> RDD will have same number of partitions as kafka, when I use spark-kafka direct stream integration.

因此,如果我在主题中有1个分区和1个执行者核心,则该核心将依次从Kafka中读取.

So if I have 1 partition in the topic, and 1 executor core, that core will sequentially read from Kafka.

如果我有以下情况怎么办?

What happens if I have:

    主题中的
  • 2个分区和仅1个执行程序核心?那个核心会先从一个分区读取,然后再从第二个分区读取,因此对主题进行分区不会有任何好处?

  • 2 partitions in the topic and only 1 executor core? Will that core read first from one partition and then from the second one, so there will be no benefit in partitioning the topic?

该主题中的2个分区和2个核心?那么,是否将1个执行者核心从1个分区中读取,并将第二个核心从第二个分区中读取?

2 partitions in the topic and 2 cores? Will then 1 executor core read from 1 partition, and second core from the second partition?

1个kafka分区和2个执行程序核心?

1 kafka partition and 2 executor cores?

谢谢.

推荐答案

基本规则是,您可以将扩展到卡夫卡分区的数量.如果您将spark.executor.cores设置为大于分区数,则某些线程将处于空闲状态.如果小于分区数,Spark将使线程从一个分区读取,然后再从另一个分区读取.所以:

The basic rule is that you can scale up to the number of Kafka partitions. If you set spark.executor.cores greater than the number of partitions, some of the threads will be idle. If it's less than the number of partitions, Spark will have threads read from one partition then the other. So:

  1. 2个分区,1个执行程序:从一个分区读取,然后从另一个分区读取. (我不确定Spark在切换之前如何决定从每个读取多少内容)

  1. 2 partitions, 1 executor: reads from one partition then then other. (I am not sure how Spark decides how much to read from each before switching)

2p,2c:并行执行

2p, 2c: parallel execution

1p,2c:一个线程处于空闲状态

1p, 2c: one thread is idle

对于情况1,请注意,具有比执行程序更多的分区是可以的,因为它允许您稍后进行横向扩展而不必重新分区.诀窍是要确保您的分区能被执行程序的数量均匀地整除.在将数据传递到管道中的下一步之前,Spark必须处理所有所有分区.因此,如果您有剩余"分区,这会减慢处理速度.例如,5个分区和4个线程=>处理需要2个分区的时间-一次要4个,然后一个线程自己运行第5个分区.

For case #1, note that having more partitions than executors is OK since it allows you to scale out later without having to re-partition. The trick is to make sure that your partitions are evenly divisible by the number of executors. Spark has to process all the partitions before passing data onto the next step in the pipeline. So, if you have 'remainder' partitions, this can slow down processing. For example, 5 partitions and 4 threads => processing takes the time of 2 partitions - 4 at once then one thread running the 5th partition by itself.

还请注意,如果通过在类似reduceByKey()的函数中显式设置数据分区的数量来在整个管道中保持分区/RDD的数量相同,您也可能会看到更好的处理吞吐量.

Also note that you may also see better processing throughput if you keep the number of partitions / RDDs the same throughout the pipeline by explicitly setting the number of data partitions in functions like reduceByKey().

这篇关于Kafka主题分区到Spark流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆