如何在窗口流媒体etl中显示中间结果? [英] How to display intermediate results in a windowed streaming-etl?

查看:20
本文介绍了如何在窗口流媒体etl中显示中间结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们目前在事件存储中实时聚合数据.这个想法是可视化多个时间范围(每月、每周、每天、每小时)和多个名义键的交易数据.我们经常有迟到的数据,所以我们需要考虑到这一点.此外,要求显示运行"结果,即当前窗口完成之前的值.

We currently do a real-time aggregation of data in an event-store. The idea is to visualize transaction data for multiple time ranges (monthly, weekly, daily, hourly) and for multiple nominal keys. We regularly have late data, so we need to account for that. Furthermore the requirement is to display "running" results, that is value of the current window even before it is complete.

目前我们正在使用 Kafka 和 Apache Storm(特别是 Trident,即微批处理)来做到这一点.我们的架构大致如下:

Currently we are using Kafka and Apache Storm (specifically Trident i.e. microbatches) to do this. Our architecture roughly looks like this:

(为我丑陋的照片道歉).我们使用 MongoDB 作为键值存储来持久化状态,然后使其可由返回查询当前值的微服务访问(只读).该设计存在多个问题

(Apologies for my ugly pictures). We use MongoDB as a key-value store to persist the State and then make it accessible (read-only) by a Microservice that returns the current value it was queried for. There are multiple problems with that design

  1. 代码维护性非常高
  2. 以这种方式很难保证只处理一次
  3. 在每次聚合后更新状态显然会影响性能,但速度足够快.

我们的印象是,自从我们开始这个项目以来,使用 Apache Flink 或 Kafka 流可以使用更好的框架(特别是从维护的角度来看 - Storm 往往非常冗长).尝试这些似乎就像写入数据库一样,尤其是 mongoDB 不再是最先进的了.我看到的标准用例是状态在 RocksDB 或内存中内部持久化,然后在窗口完成后写回 Kafka.

We got the impression, that with Apache Flink or Kafka streams better frameworks (especially from a maintenance standpoint - Storm tends to be really verbose) have become available since we started this project. Trying these out it seemed like writing to a database, especially mongoDB is not state of the art anymore. The standard use case I saw is state being persisted internally in RocksDB or memory and then written back to Kafka once a window is complete.

不幸的是,这使得显示中间结果变得非常困难,并且由于状态在内部持久化,我们需要允许事件的延迟时间为几个月或几年.对于这种需求,是否有比劫持实时流的状态更好的解决方案?我个人觉得这是一个标准要求,但找不到标准解决方案.

Unfortunately this makes it quite difficult to display intermediate results and because the state is persisted internally we would need the allowed Lateness of events to be in the order of months or years. Is there a better solution for this requirements than hijacking the state of the real-time stream? Personally I feel like this would be a standard requirement but couldn't find a standard solution for this.

推荐答案

您可以学习 Konstantin Knauf 的 Queryable BillingDemo 作为如何处理所涉及的一些问题的示例.那里使用的核心相关思想是:

You could study Konstantin Knauf's Queryable Billing Demo as an example of how to approach some of the issues involved. The central, relevant ideas used there are:

  1. 在每次事件后触发窗口,以便其结果不断更新
  2. 使结果可查询(使用 Flink 的 可查询状态 API)

这是 Flink Forward 会议演讲的主题.视频可用.

This was the subject of a Flink Forward conference talk. Video is available.

与其使结果可查询,不如将窗口更新流式传输到仪表板或数据库.

Rather than making the results queryable, you could instead stream out the window updates to a dashboard or database.

另外,请注意您可以级联窗口,这意味着每小时窗口的结果可以作为每日窗口等的输入.

Also, note that you can cascade windows, meaning that the results of the hourly windows could be the input to the daily windows, etc.

这篇关于如何在窗口流媒体etl中显示中间结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆