使用 Spark Streaming 时限制 Kafka 批量大小 [英] Limit Kafka batches size when using Spark Streaming

查看:48
本文介绍了使用 Spark Streaming 时限制 Kafka 批量大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以限制 Kafka 消费者为 Spark Streaming 返回的批次大小?

Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming?

我之所以这么问是因为我得到的第一批记录有数亿条记录,处理和检查它们需要很长时间.

I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them.

推荐答案

我认为您的问题可以通过 Spark Streaming Backpressure 来解决.

I think your problem can be solved by Spark Streaming Backpressure.

检查 spark.streaming.backpressure.enabledspark.streaming.backpressure.initialRate.

默认spark.streaming.backpressure.initialRate 未设置spark.streaming.backpressure.enabled 禁用 默认情况下,所以我想 spark 会尽可能多地使用.

By default spark.streaming.backpressure.initialRate is not set and spark.streaming.backpressure.enabled is disabled by default so I suppose spark will take as much as he can.

来自 Apache Spark Kafka 配置

spark.streaming.backpressure.enabled:

这使 Spark Streaming 能够控制基于接收速率关于当前批处理调度延迟和处理时间,以便系统只能以系统可以处理的速度接收.在内部,这会动态设置最大接收速率接收器.此比率受值的上限spark.streaming.receiver.maxRatespark.streaming.kafka.maxRatePerPartition 如果它们已设置(见下文).

This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below).

并且由于您想控制第一批,或者更具体地 - 第一批中的消息数量,我认为您需要 spark.streaming.backpressure.initialRate

And since you want to control first batch, or to be more specific - number of messages in first batch, I think you need spark.streaming.backpressure.initialRate

spark.streaming.backpressure.initialRate:

这是每个接收器的初始最大接收速率背压机制开启时接收第一批数据已启用.

This is the initial maximum receiving rate at which each receiver will receive data for the first batch when the backpressure mechanism is enabled.

当您的 Spark 工作(分别是 Spark 工作人员)能够处理来自 kafka 的 10000 条消息,但 kafka 经纪人为您的工作提供 100000 条消息时,这个是很好的.

This one is good when your Spark job (respectively Spark workers at all) is able to process let say 10000 messages from kafka, but kafka brokers give to your job 100000 messages.

也许您也有兴趣查看 spark.streaming.kafka.maxRatePerPartition 以及 Jeroen van Wilgenburg 在他的博客上.

Maybe you will be also interested to check spark.streaming.kafka.maxRatePerPartition and also some research and suggestions for these properties on real example by Jeroen van Wilgenburg on his blog.

这篇关于使用 Spark Streaming 时限制 Kafka 批量大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆