带有 kafka 的 Spark 结构化流只导致一批(Pyspark) [英] Spark structured streaming with kafka leads to only one batch (Pyspark)

查看:31
本文介绍了带有 kafka 的 Spark 结构化流只导致一批(Pyspark)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码,我想知道为什么它只生成一批:

I have the following code and I'm wondering why it generates only one batch:

df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "IP").option("subscribe", "Topic").option("startingOffsets","earliest").load()
// groupby on slidings windows
query = slidingWindowsDF.writeStream.queryName("bla").outputMode("complete").format("memory").start()

使用以下参数启动应用程序:

The application is launched with the following parameters:

spark.streaming.backpressure.initialRate 5
spark.streaming.backpressure.enabled True

kafka 主题包含大约 1100 万条消息.由于 initialRate 参数,我期望它至少应该生成两个批次,但它只生成一个.谁能告诉我为什么 spark 只处理一批代码?

The kafka topic contains around 11 million messages. I'm expecting that it should at least generate two batches due to the initialRate parameter, but it generates only one. Can anyone tell why spark is processing my code in only one batch?

我使用的是 Spark 2.2.1 和 Kafka 1.0.

I'm using Spark 2.2.1 and Kafka 1.0.

推荐答案

那是因为 spark.streaming.backpressure.initialRate 参数仅用于旧的 Spark Streaming,而不是 Structured Streaming.

That is because spark.streaming.backpressure.initialRate parameter is used only by old Spark Streaming, not Structured Streaming.

相反,使用 maxOffsetsPerTrigger:http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

顺便说一句,另请参阅此答案:Spark Structured Streaming 如何处理背压?, SSS 现在没有完全的背压支持

BTW, see also this answer: How Spark Structured Streaming handles backpressure?, SSS now don't have full backpressure support

这篇关于带有 kafka 的 Spark 结构化流只导致一批(Pyspark)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆