在追加模式下,带水印的聚合查询的输出为空 [英] Empty output for Watermarked Aggregation Query in Append Mode
问题描述
我使用Spark 2.2.0-rc1.
I use Spark 2.2.0-rc1.
我有一个Kafka topic
,我要查询一个正在运行的带有水印的加水印聚合,并以append
输出模式提供给console
.
I've got a Kafka topic
which I'm querying a running watermarked aggregation, with a 1 minute
watermark, giving out to console
with append
output mode.
import org.apache.spark.sql.types._
val schema = StructType(StructField("time", TimestampType) :: Nil)
val q = spark.
readStream.
format("kafka").
option("kafka.bootstrap.servers", "localhost:9092").
option("startingOffsets", "earliest").
option("subscribe", "topic").
load.
select(from_json(col("value").cast("string"), schema).as("value"))
select("value.*").
withWatermark("time", "1 minute").
groupBy("time").
count.
writeStream.
outputMode("append").
format("console").
start
我正在Kafka topic
中推送以下数据:
I am pushing following data in Kafka topic
:
{"time":"2017-06-07 10:01:00.000"}
{"time":"2017-06-07 10:02:00.000"}
{"time":"2017-06-07 10:03:00.000"}
{"time":"2017-06-07 10:04:00.000"}
{"time":"2017-06-07 10:05:00.000"}
我得到以下输出:
scala> -------------------------------------------
Batch: 0
-------------------------------------------
+----+-----+
|time|count|
+----+-----+
+----+-----+
-------------------------------------------
Batch: 1
-------------------------------------------
+----+-----+
|time|count|
+----+-----+
+----+-----+
-------------------------------------------
Batch: 2
-------------------------------------------
+----+-----+
|time|count|
+----+-----+
+----+-----+
-------------------------------------------
Batch: 3
-------------------------------------------
+----+-----+
|time|count|
+----+-----+
+----+-----+
-------------------------------------------
Batch: 4
-------------------------------------------
+----+-----+
|time|count|
+----+-----+
+----+-----+
这是预期的行为吗?
推荐答案
将更多数据推送到Kafka应该会触发Spark输出一些内容.当前的行为完全是由于内部实现.
Pushing more data to Kafka should trigger Spark to output something. The current behavior is totally because of the internal implementation.
当您推送一些数据时,StreamingQuery将生成要运行的批处理.当该批次完成时,它将记住该批次中的最大事件时间.然后在下一批中
因为您使用的是append
模式,所以StreamingQuery将使用最大事件时间和水印从StateStore中逐出旧值并将其输出.因此,您需要确保至少生成两个批次才能看到输出.
When you push some data, StreamingQuery will generate a batch to run. When this batch finishes, it will remember the max event time in this batch. Then in the next batch,
because you are using append
mode, StreamingQuery will use the max event time and watermark to evict old values from StateStore and output it. Therefore you need to make sure generating at least two batches in order to see output.
这篇关于在追加模式下,带水印的聚合查询的输出为空的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!