Spark Streaming Kafka createDirectStream - Spark UI 将输入事件大小显示为零 [英] Spark Streaming Kafka createDirectStream - Spark UI shows input event size as zero
问题描述
我已经使用 createDirectStream 实现了 Spark Streaming.我的 Kafka 生产者每秒向具有两个分区的主题发送多条消息.
I have implemented Spark Streaming using createDirectStream. My Kafka producer is sending several messages every second to a topic with two partitions.
在 Spark 流媒体方面,我每秒读取一次 kafka 消息,然后我将它们按 5 秒的窗口大小和频率进行窗口化.
On Spark streaming side, i read kafka messages every second and them I'm windowing them on 5 second window size and frequency.
Kafka 消息得到了正确处理,我看到了正确的计算和打印.
Kafka message are properly processed, i'm seeing the right computations and prints.
但在 Spark Web UI 中,在 Streaming 部分下,它显示每个窗口的事件数为零.请看这张图:
But in Spark Web UI, under Streaming section, it is showing number of events per window as Zero. Please see this image:
我很困惑为什么它显示零,它不应该显示馈入 Spark Stream 的 Kafka 消息的数量吗?
I'm puzzled why is it showing Zero, shouldn't it show number of Kafka messages being feed into Spark Stream?
更新:
这个问题似乎在我使用 groupByKeyAndWindow() api 时发生.当我从我的代码中注释掉这个 api 用法时,Spark Streaming UI 开始正确报告 Kafka 事件输入大小.
This issue seems to be happening when i use groupByKeyAndWindow() api. When i commented out this api usage from my code, Spark Streaming UI started reporting Kafka event input size correctly.
知道为什么会这样吗?这会是 Spark Streaming 的缺陷吗?
Any idea why is this so? Could this a defect in Spark Streaming?
我使用的是 Cloudera CDH:5.5.1,Spark:1.5.0,Kafka:KAFKA-0.8.2.0-1.kafka1.4.0.p0.56
I'm using Cloudera CDH: 5.5.1, Spark: 1.5.0, Kafka: KAFKA-0.8.2.0-1.kafka1.4.0.p0.56
推荐答案
好像不是Spark Kafka库代码记录的.
It seems that it is not recorded by the Spark Kafka library code.
基于Spark 2.3.1
- 搜索
Input Size/Records
,发现是stageData.inputBytes
(StagePage.scala)的值 - 搜索
StageData
和inputBytes
,发现是metrics.inputMetrics.bytesRead
(LiveEntity.scala)的值 - 搜索
bytesRead
,发现它在HadoopRDD.scala
、FileScanRDD.scala
和ShuffleSuite.scala
中设置.但不在任何与 Kafka 相关的文件中.
- Search
Input Size / Records
, found it is the value ofstageData.inputBytes
(StagePage.scala) - Search
StageData
andinputBytes
, found it is the value ofmetrics.inputMetrics.bytesRead
(LiveEntity.scala) - Search
bytesRead
, found it's set inHadoopRDD.scala
,FileScanRDD.scala
andShuffleSuite.scala
. But not in any Kafka related files.
这篇关于Spark Streaming Kafka createDirectStream - Spark UI 将输入事件大小显示为零的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!