为什么在流式数据帧/数据集上有流式聚合时,流式数据集会失败并显示“不支持完整输出模式"? [英] Why does streaming Dataset fail with "Complete output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets... "?

查看:78
本文介绍了为什么在流式数据帧/数据集上有流式聚合时,流式数据集会失败并显示“不支持完整输出模式"?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Spark 2.2.0,并且在Windows上的Spark结构化流媒体出现以下错误:

I use Spark 2.2.0 and have the following error with Spark Structured Streaming on windows:

streaming数据帧/数据集上有 streaming聚合而没有 watermark 时,不支持

完全输出模式.

Complete output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark.

推荐答案

当不带水印的流数据帧/数据集上存在流聚合时,不支持完全输出模式

Complete output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark

流聚合要求您告诉Spark结构化流引擎何时输出聚合(根据所谓的输出模式),因为可能属于聚合的数据可能较晚且仅可用一段时间后.

Streaming aggregations require that you tell the Spark Structured Streaming engine when to output the aggregation (per so-called output mode) since the data that could be part of an aggregation might be late and available only after some time.

某个时间"部分是事件延迟,并描述为当前时间之前的水印时间.

The "some time" part is event lateness and described as the time that is watermark ago from the current time.

这就是为什么您必须指定水印才能让Spark丢弃/忽略任何较晚的事件并停止累积可能最终导致OutOfMemoryError或类似情况的状态.

That's why you have to specify the watermark to let Spark drop/disregard any late events and stop accumulating state that could eventually lead to OutOfMemoryError or similar.

话虽如此,您应该使用

With that said, you should use withWatermark operator on your streaming Dataset.

withWatermark 定义此数据集的事件时间水印.水印会跟踪一个时间点,在该时间点之前,我们假设不再有最新数据要到达.

withWatermark Defines an event time watermark for this Dataset. A watermark tracks a point in time before which we assume no more late data is going to arrive.

并引用...

Spark会将此水印用于多个目的:

Spark will use this watermark for several purposes:

  • 要知道何时可以终止给定的时间窗口聚合,从而在使用不允许更新的输出模式时可以发出聚合.
  • 为使正在进行的聚合所需保留的状态量最小化,请使用mapGroupsWithState和dropDuplicates运算符.

当前水印是通过查看查询中所有分区上看到的MAX(eventTime)减去用户指定的delayThreshold来计算的.由于跨分区协调此值的成本较高,因此只能保证所使用的实际水印至少比实际事件时间晚delayThreshold.在某些情况下,我们可能仍会处理比delayThreshold晚到达的记录.

The current watermark is computed by looking at the MAX(eventTime) seen across all of the partitions in the query minus a user specified delayThreshold. Due to the cost of coordinating this value across partitions, the actual watermark used is only guaranteed to be at least delayThreshold behind the actual event time. In some cases we may still process records that arrive more than delayThreshold late.

查看Spark Structured Streaming的

Check out Spark Structured Streaming's Handling Late Data and Watermarking.

这篇关于为什么在流式数据帧/数据集上有流式聚合时,流式数据集会失败并显示“不支持完整输出模式"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆