如何在窗口化流式ETL中显示中间结果? [英] How to display intermediate results in a windowed streaming-etl?

查看:50
本文介绍了如何在窗口化流式ETL中显示中间结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们目前在事件存储中进行实时数据聚合.想法是可视化多个时间范围(每月,每周,每天,每小时)和多个名义密钥的交易数据.我们经常有较晚的数据,因此我们需要考虑这一点.此外,要求是显示正在运行"的结果,即在完成之前即为当前窗口的值.

We currently do a real-time aggregation of data in an event-store. The idea is to visualize transaction data for multiple time ranges (monthly, weekly, daily, hourly) and for multiple nominal keys. We regularly have late data, so we need to account for that. Furthermore the requirement is to display "running" results, that is value of the current window even before it is complete.

当前,我们正在使用Kafka和Apache Storm(特别是Trident,即微型批次)来执行此操作.我们的架构大致如下:

Currently we are using Kafka and Apache Storm (specifically Trident i.e. microbatches) to do this. Our architecture roughly looks like this:

(为我的丑陋图片致歉).我们将MongoDB用作键值存储来持久保存State,然后使之返回可查询的当前值的微服务对其进行访问(只读).该设计存在多个问题

(Apologies for my ugly pictures). We use MongoDB as a key-value store to persist the State and then make it accessible (read-only) by a Microservice that returns the current value it was queried for. There are multiple problems with that design

  1. 该代码的维护成本很高
  2. 以这种方式保证一次处理真的很困难
  3. 每次聚合后更新状态显然会影响性能,但是速度足够快.

我们的印象是,自从我们开始这个项目以来,有了Apache Flink或Kafka流,更好的框架(特别是从维护的角度来看-Storm往往很冗长)已经可用.尝试这些似乎就像写入数据库一样,尤其是mongoDB不再是最新技术了.我看到的标准用例是状态保持在RocksDB或内存内部,然后在窗口完成后写回到Kafka.

We got the impression, that with Apache Flink or Kafka streams better frameworks (especially from a maintenance standpoint - Storm tends to be really verbose) have become available since we started this project. Trying these out it seemed like writing to a database, especially mongoDB is not state of the art anymore. The standard use case I saw is state being persisted internally in RocksDB or memory and then written back to Kafka once a window is complete.

不幸的是,这使得显示中间结果变得非常困难,并且由于状态在内部保持不变,因此我们需要允许事件的延迟时间在数月或数年的数量级.有没有比劫持实时流状态更好的解决方案了?我个人认为这将是一个标准要求,但找不到为此的标准解决方案.

Unfortunately this makes it quite difficult to display intermediate results and because the state is persisted internally we would need the allowed Lateness of events to be in the order of months or years. Is there a better solution for this requirements than hijacking the state of the real-time stream? Personally I feel like this would be a standard requirement but couldn't find a standard solution for this.

推荐答案

您可以研究 Konstantin Knauf的可查询帐单演示作为示例,介绍了如何解决其中的一些问题.此处使用的主要相关思想是:

You could study Konstantin Knauf's Queryable Billing Demo as an example of how to approach some of the issues involved. The central, relevant ideas used there are:

  1. 在每次事件发生后触发窗口,以便其结果不断更新
  2. 使结果可查询(使用Flink的这是Flink Forward会议演讲的主题.视频可用.

    This was the subject of a Flink Forward conference talk. Video is available.

    除了使结果可查询之外,您还可以将窗口更新流式传输到仪表板或数据库.

    Rather than making the results queryable, you could instead stream out the window updates to a dashboard or database.

    另外,请注意,您可以层叠窗口,这意味着每小时窗口的结果可以作为每日窗口的输入,等等.

    Also, note that you can cascade windows, meaning that the results of the hourly windows could be the input to the daily windows, etc.

    这篇关于如何在窗口化流式ETL中显示中间结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆