流数据处理和纳秒时间分辨率 [英] Streaming Data Processing and nano second time resolution

查看:24
本文介绍了流数据处理和纳秒时间分辨率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始讨论实时流数据处理框架的话题,我有一个问题,目前我还没有找到任何决定性的答案:

I'm just starting into the topic of real-time stream data processing frameworks, and I have a question to which I as of yet could not find any conclusive answer:

通常的怀疑对象(Apache 的 Spark、Kafka、Storm、Flink 等)是否支持以 纳秒(甚至皮秒)的事件时间分辨率处理数据?

Do the usual suspects (Apache's Spark, Kafka, Storm, Flink etc.) support processing data with an event time resolution of nanoseconds (or even picoseconds)?

大多数人和文档都谈论毫秒或微秒分辨率,但我无法找到明确的答案,如果可能有更多分辨率或问题.我推断唯一具有此功能的框架是 influxData 的 Kapacitor 框架,因为他们的 TSDB influxDB 似乎以纳秒分辨率存储时间戳.

Most people and documentation talk about a millisecond or microsecond resolution, but I was unable to find a definite answer if more resolution would be possible or a problem. The only framework I infer to have the capability is influxData's Kapacitor framework, as their TSDB influxDB seems to be storing timestamps at nanosecond resolution.

这里的任何人都可以对此提供一些见解或什至一些知情的事实吗?提供此功能的替代解决方案/框架?

Can anybody here offer some insights on this or even some informed facts? Alternative solutions/frameworks offering this capability?

任何东西都将不胜感激!

Anything would be much appreciated!

感谢和问候,

西蒙

我的问题的背景:我正在一个环境中工作,该环境具有许多用于数据存储和处理的专有实现,目前正在考虑一些组织/优化.我们正在使用许多不同的诊断/测量系统以不同的采样率进行等离子体物理实验,现在达到每秒超过 Giga 样本".我们系统中的一个常见事实/假设是,每个样本确实具有以纳秒分辨率记录的事件时间.当尝试使用已建立的流(或批处理)处理框架时,我们必须保持此时间戳分辨率.或者更进一步,因为我们最近在某些系统上突破了 1 Gsps 阈值.因此我的问题.

Background of my question: I'm working in an environment with quite a number of proprietary implementations for data storage and processing and thinking about some organization/optimization presently. We are doing plasma physics experiments with a lot of different diagnostic/measurement systems at various sampling rates, now up to "above Giga samples per second". The one common fact/assumption in our systems is that each sample does have a recorded event time in nanosecond resolution. When trying to employ an established stream (or also batch) processing framework, we would have to keep this timestamp resolution. Or go even further as we recently breached the 1 Gsps threshold with some systems. Hence my question.

推荐答案

如果不清楚,您应该注意事件时间和处理时间之间的区别:

In case this is not clear, you should be aware of the difference between event time and processing time:

事件时间 - 源事件发生的时间

event time - time of generation of the event at the source

处理时间 - 处理引擎内事件执行的时间

processing time - time of event execution within processing engine

src: Flink 文档

AFAIK Storm 不支持事件时间,Spark 支持有限.这就留下了 Kafka Streams 和 Flink 的考虑.

AFAIK Storm doesn't support event time and Spark has limited support. That leaves Kafka Streams and Flink for consideration.

Flink 使用 long 类型作为时间戳.docs 中提到了它该值是自 1970-01-01T00:00:00Z 以来的毫秒数,但是 AFAIK,当您使用事件时间特征时,唯一的进度度量是事件时间戳.所以,如果你能将你的价值观纳入长期范围,那么它应该是可行的.

Flink uses long type for timestamps. It is mentioned in the docs that this value is for milliseconds since 1970-01-01T00:00:00Z, but AFAIK, when you use event time characteristic, the only measure of progress are event timestamps. So, if you can fit your values into the long range, then it should be doable.

一般而言,水印(基于时间戳)用于测量窗口、触发器等中事件时间的进度.因此,如果您使用:

In general watermarks (based on timestamps) are used for measuring the progress of event time in windows, triggers etc. So, if you use:

  • AssignerWithPeriodicWatermarks 然后在处理时域的配置中定义的间隔(自动水印间隔)发出一个新的水印 - 即使使用了事件时间特征.有关详细信息,请参见例如 org.apache.flink.streaming.runtime.operators.TimestampsAndPeriodicWatermarksOperator#open() 方法,其中注册了一个处理时间的计时器.因此,如果自动水印设置为 500 毫秒,那么每 500 毫秒的处理时间(取自 System.currentTimeMillis())就会发出一个新的水印,但水印的时间戳基于来自事件.

  • AssignerWithPeriodicWatermarks then a new watermark is emitted in intervals defined in config (autowatermark interval) in processing time domain - even when event time characteristic is used. For details see eg org.apache.flink.streaming.runtime.operators.TimestampsAndPeriodicWatermarksOperator#open() method, where a timer in processing time is registered. So, if autowatermark is set to 500ms, then every 500ms of processing time (as taken from System.currentTimeMillis()) a new watermark is emitted, but the timestamp of the watermark is based on the timestamp from events.

AssignerWithPunctuatedWatermarks 那么最好的描述可以在 org.apache.flink.streaming.api.datastream.DataStream#assignTimestampsAndWatermarks(org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks):

AssignerWithPunctuatedWatermarks then the best description can be found in docs for org.apache.flink.streaming.api.datastream.DataStream#assignTimestampsAndWatermarks(org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks<T>):

为数据流中的元素分配时间戳,并根据元素本身创建水印以表示事件时间进度.

Assigns timestamps to the elements in the data stream and creates watermarks to signal event time progress based on the elements themselves.

此方法纯粹基于流元素创建水印.对于通过 AssignerWithPunctuatedWatermarks#extractTimestamp(Object, long) 处理的每个元素,AssignerWithPunctuatedWatermarks#checkAndGetNextWatermark(Object, long) 方法被调用,如果返回的水印值为非负且大于之前的水印值,则发出一个新的水印.

This method creates watermarks based purely on stream elements. For each element that is handled via AssignerWithPunctuatedWatermarks#extractTimestamp(Object, long), the AssignerWithPunctuatedWatermarks#checkAndGetNextWatermark(Object, long) method is called, and a new watermark is emitted, if the returned watermark value is non-negative and greater than the previous watermark.

当数据流嵌入水印元素,或某些元素携带可用于确定当前事件时间水印的标记时,此方法很有用.此操作使程序员可以完全控制水印的生成.用户应该意识到过于激进的水印生成(即每秒生成数百个水印)会降低性能.

This method is useful when the data stream embeds watermark elements, or certain elements carry a marker that can be used to determine the current event time watermark. This operation gives the programmer full control over the watermark generation. Users should be aware that too aggressive watermark generation (i.e., generating hundreds of watermarks every second) can cost some performance.

要了解水印的工作原理,强烈建议阅读以下内容:Tyler Akidau 谈 Streaming 102

To understand how watermarks work, this read is highly recommended: Tyler Akidau on Streaming 102

这篇关于流数据处理和纳秒时间分辨率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆