在追加模式下,带水印的聚合查询的输出为空 [英] Empty output for Watermarked Aggregation Query in Append Mode

查看:43
本文介绍了在追加模式下,带水印的聚合查询的输出为空的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Spark 2.2.0-rc1.

I use Spark 2.2.0-rc1.

我有一个Kafka topic,我要查询一个正在运行的带有水印的加水印聚合,并以append输出模式提供给console.

I've got a Kafka topic which I'm querying a running watermarked aggregation, with a 1 minute watermark, giving out to console with append output mode.

import org.apache.spark.sql.types._
val schema = StructType(StructField("time", TimestampType) :: Nil)
val q = spark.
  readStream.
  format("kafka").
  option("kafka.bootstrap.servers", "localhost:9092").
  option("startingOffsets", "earliest").
  option("subscribe", "topic").
  load.
  select(from_json(col("value").cast("string"), schema).as("value"))
  select("value.*").
  withWatermark("time", "1 minute").
  groupBy("time").
  count.
  writeStream.
  outputMode("append").
  format("console").
  start

我正在Kafka topic中推送以下数据:

I am pushing following data in Kafka topic:

{"time":"2017-06-07 10:01:00.000"}
{"time":"2017-06-07 10:02:00.000"}
{"time":"2017-06-07 10:03:00.000"}
{"time":"2017-06-07 10:04:00.000"}
{"time":"2017-06-07 10:05:00.000"}

我得到以下输出:

scala> -------------------------------------------
Batch: 0
-------------------------------------------
+----+-----+                                                                    
|time|count|
+----+-----+
+----+-----+

-------------------------------------------
Batch: 1
-------------------------------------------
+----+-----+                                                                    
|time|count|
+----+-----+
+----+-----+

-------------------------------------------
Batch: 2
-------------------------------------------
+----+-----+                                                                    
|time|count|
+----+-----+
+----+-----+

-------------------------------------------
Batch: 3
-------------------------------------------
+----+-----+                                                                    
|time|count|
+----+-----+
+----+-----+

-------------------------------------------
Batch: 4
-------------------------------------------
+----+-----+                                                                    
|time|count|
+----+-----+
+----+-----+

这是预期的行为吗?

推荐答案

将更多数据推送到Kafka应该会触发Spark输出一些内容.当前的行为完全是由于内部实现.

Pushing more data to Kafka should trigger Spark to output something. The current behavior is totally because of the internal implementation.

当您推送一些数据时,StreamingQuery将生成要运行的批处理.当该批次完成时,它将记住该批次中的最大事件时间.然后在下一批中 因为您使用的是append模式,所以StreamingQuery将使用最大事件时间和水印从StateStore中逐出旧值并将其输出.因此,您需要确保至少生成两个批次才能看到输出.

When you push some data, StreamingQuery will generate a batch to run. When this batch finishes, it will remember the max event time in this batch. Then in the next batch, because you are using append mode, StreamingQuery will use the max event time and watermark to evict old values from StateStore and output it. Therefore you need to make sure generating at least two batches in order to see output.

这篇关于在追加模式下,带水印的聚合查询的输出为空的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆