使用Spark Streaming时限制Kafka批次大小 [英] Limit Kafka batches size when using Spark Streaming

查看:1061
本文介绍了使用Spark Streaming时限制Kafka批次大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以限制Kafka使用者返回的Spark Streaming批次的大小?

Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming?

我之所以问是因为我获得的第一批记录有亿万条记录,并且需要很长时间才能处理并检查它们.

I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them.

推荐答案

我认为您可以通过 Spark Streaming Backpressure (火花流反压)解决您的问题.

I think your problem can be solved by Spark Streaming Backpressure.

检查spark.streaming.backpressure.enabledspark.streaming.backpressure.initialRate.

默认情况下,默认情况下spark.streaming.backpressure.initialRate未设置 ,而spark.streaming.backpressure.enabled已禁用 ,所以我认为火花将占用他尽可能多的空间.

By default spark.streaming.backpressure.initialRate is not set and spark.streaming.backpressure.enabled is disabled by default so I suppose spark will take as much as he can.

来自 Apache Spark Kafka配置

spark.streaming.backpressure.enabled:

spark.streaming.backpressure.enabled:

这使Spark Streaming可以根据接收速率控制接收速率 关于当前批处理计划的延迟和处理时间,以便 系统接收的速度只能达到系统可以处理的速度. 在内部,这动态设置了最大接收速率 接收者.此速率是值的上限 spark.streaming.receiver.maxRatespark.streaming.kafka.maxRatePerPartition(如果已设置)(请参见下文).

This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below).

并且由于您想控制第一批,或更具体地说-第一批中的邮件数量,我认为您需要spark.streaming.backpressure.initialRate

And since you want to control first batch, or to be more specific - number of messages in first batch, I think you need spark.streaming.backpressure.initialRate

spark.streaming.backpressure.initialRate:

spark.streaming.backpressure.initialRate:

这是每个接收器将以的最大初始接收速率 当背压机制为时接收第一批数据 启用.

This is the initial maximum receiving rate at which each receiver will receive data for the first batch when the backpressure mechanism is enabled.

当您的Spark作业(分别是所有Spark工作者)能够处理从kafka发出的10000条消息(但是,kafka经纪人向您的工作提供100000条消息)时,这是一个好方法.

This one is good when your Spark job (respectively Spark workers at all) is able to process let say 10000 messages from kafka, but kafka brokers give to your job 100000 messages.

也许您也有兴趣检查spark.streaming.kafka.maxRatePerPartition,并通过

Maybe you will be also interested to check spark.streaming.kafka.maxRatePerPartition and also some research and suggestions for these properties on real example by Jeroen van Wilgenburg on his blog.

这篇关于使用Spark Streaming时限制Kafka批次大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆