Apache Beam:为什么全局窗口中聚合值的时间戳是9223371950454775? [英] Apache Beam: why is the timestamp of aggregate value in Global Window 9223371950454775?

查看:23
本文介绍了Apache Beam:为什么全局窗口中聚合值的时间戳是9223371950454775?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们从 Google Dataflow 1.9 迁移到 Apache Beam 0.6.我们注意到在应用 globalwindow 后时间戳的行为发生了变化.在 Google Dataflow 1.9 中,我们将在窗口/组合函数之后在 DoFn 中获得正确的时间戳.现在我们得到了一些巨大的时间戳值,例如9223371950454775, Apache Beam 版本中全局窗口的默认行为是否发生了变化?

We migrated from Google Dataflow 1.9 to Apache Beam 0.6. We are noticing a change in the behavior to the timestamps after applying the globalwindow. In Google Dataflow 1.9, we would get the correct timestamps in the DoFn after windowing/combine function. Now we get some huge value for the timestamp e.g. 9223371950454775, Did the default behavior for the globalwindow change in Apache Beam version?

input.apply(name(id, "Assign To Shard"), ParDo.of(new AssignToTest()))
      .apply(name(id, "Window"), Window
          .<KV<Long, ObjectNode >>into(new GlobalWindows())
          .triggering(Repeatedly.forever(
              AfterProcessingTime
                  .pastFirstElementInPane()
                  .plusDelayOf(Duration.standardMinutes(1))))
          .discardingFiredPanes())
      .apply(name(id, "Group By Shard"), GroupByKey.create())
      .appy(.....) }

推荐答案

TL;DR: 当你组合一堆带时间戳的值时,你需要为聚合的结果选择一个时间戳.这个输出时间戳有多个很好的答案.在 Dataflow 1.x 中,默认值是输入时间戳的最小值.根据我们在 Beam 中使用 1.x 的经验,默认值已更改为窗口的末尾.您可以通过将 .withTimestampCombiner(TimestampCombiner.EARLIEST) 添加到您的 Window 转换来恢复之前的行为.

TL;DR: When you are combining a bunch of timestamped values, you need to choose a timestamp for the result of the aggregation. There are multiple good answers for this output timestamp. In Dataflow 1.x the default was the minimum of the input timestamps. Based on our experience with 1.x in Beam the default was changed to the end of the window. You can restore the prior behavior by adding .withTimestampCombiner(TimestampCombiner.EARLIEST) to your Window transform.

我会拆开这个.让我们使用 @ 符号来配对一个值和它的时间戳.只关注一个键,你有时间戳值 v1@t1, v2@t2, ... 等等.我会坚持你的原始 示例GroupByKey 即使这也适用于组合值的其他方式.所以值的输出迭代是 [v1, v2, ...] 任意顺序.

I'll unpack this. Let's use the @ sign to pair up a value and its timestamp. Focusing on just one key, you have timestamped values v1@t1, v2@t2, ..., etc. I will stick with your example of a raw GroupByKey even though this also applies to other ways of combining the values. So the output iterable of values is [v1, v2, ...] in arbitrary order.

以下是时间戳的一些可能性:

Here are some possibilities for the timestamp:

  • min(t1, t2, ...)
  • max(t1, t2, ...)
  • 这些元素所在的窗口的结尾(忽略输入时间戳)

所有这些都是正确的.这些都可用作 Dataflow 1.x 中的 OutputTimeFn 和 Apache Beam 中的 TimestampCombiner 的选项.

All of these are correct. These are all available as options for your OutputTimeFn in Dataflow 1.x and TimestampCombiner in Apache Beam.

时间戳有不同的解释,它们对不同的事情有用.聚合值的输出时间控制下游水印.所以选择较早的时间戳更能保持下游水印,而较晚的时间戳允许它向前移动.

The timestamps have different interpretations and they are useful for different things. The output time of the aggregated value governs the downstream watermark. So choosing earlier timestamps holds the downstream watermark more, while later timestamps allows it to move ahead.

  • min(t1, t2, ...) 允许你解压迭代并重新输出 v1@t1
  • max(t1, t2, ...) 准确地模拟聚合值完全可用的逻辑时间.Max 往往是最昂贵的,原因与实施细节有关.
  • 窗口结束:
    • 模拟这个聚合代表所有窗口数据的事实
    • 非常容易理解
    • 允许下游水印尽可能快地推进
    • 效率极高
    • min(t1, t2, ...) allows you to unpack the iterable and re-output v1@t1
    • max(t1, t2, ...) accurately models the logical time that the aggregated value was fully available. Max does tend to be the most expensive, for reasons to do with implementation details.
    • end of the window:
      • models the fact that this aggregation represents all the data for the window
      • is very easy to understand
      • allows downstream watermarks to advance as fast as possible
      • is extremely efficient

      出于所有这些原因,我们将默认值从 min 切换到窗口结束时间.

      For all of these reasons, we switched the default from the min to end of window.

      在 Beam 中,您可以通过将 .withTimestampCombiner(TimestampCombiner.EARLIEST) 添加到您的 Window 转换来恢复先前的行为.在 Dataflow 1.x 中,您可以通过添加 .withOutputTimeFn(OutputTimeFns.outputAtEndOfWindow()) 来迁移到 Beam 的默认设置.

      In Beam, you can restore the prior behavior by adding .withTimestampCombiner(TimestampCombiner.EARLIEST) to your Window transform. In Dataflow 1.x you can migrate to Beam's defaults by adding .withOutputTimeFn(OutputTimeFns.outputAtEndOfWindow()).

      另一个技术性是用户定义的 OutputTimeFn 被删除并替换为 TimestampCombiner 枚举,所以只有这三个选择,而不是一个完整的 API 来编写你的自己的.

      Another technicality is that the user-defined OutputTimeFn is removed and replaced by the TimestampCombiner enum, so there are only these three choices, not a whole API to write your own.

      这篇关于Apache Beam:为什么全局窗口中聚合值的时间戳是9223371950454775?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆