流数据处理和纳秒级分辨率 [英] Streaming Data Processing and nano second time resolution

查看:197
本文介绍了流数据处理和纳秒级分辨率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始讨论实时流数据处理框架的主题,但是我有一个问题,到目前为止我还没有找到任何确凿的答案:

I'm just starting into the topic of real-time stream data processing frameworks, and I have a question to which I as of yet could not find any conclusive answer:

通常的可疑对象(Apache的Spark,Kafka,Storm,Flink等)是否支持处理事件时间分辨率为纳秒(甚至皮秒)的数据?

Do the usual suspects (Apache's Spark, Kafka, Storm, Flink etc.) support processing data with an event time resolution of nanoseconds (or even picoseconds)?

大多数人和文档都在谈论毫秒或微秒的分辨率,但是如果有可能实现更高的分辨率或出现问题,我将无法找到确切的答案.我推断唯一具有此功能的框架是influxData的Kapacitor框架,因为他们的TSDB influxDB似乎以纳秒的分辨率存储时间戳.

Most people and documentation talk about a millisecond or microsecond resolution, but I was unable to find a definite answer if more resolution would be possible or a problem. The only framework I infer to have the capability is influxData's Kapacitor framework, as their TSDB influxDB seems to be storing timestamps at nanosecond resolution.

这里有人可以提供一些有关此方面的见解,甚至是一些有根据的事实吗?提供此功能的替代解决方案/框架?

Can anybody here offer some insights on this or even some informed facts? Alternative solutions/frameworks offering this capability?

任何事情将不胜感激!

感谢和问候,

西蒙(Simon)

我的问题的背景:我正在一个环境中工作,该环境具有许多用于数据存储和处理的专有实现,并且目前正在考虑一些组织/优化.我们正在使用许多不同的诊断/测量系统,以不同的采样率进行等离子体物理实验,现在达到每秒Giga样本以上".我们系统中的一个普遍事实/假设是每个样本的确记录了事件时间(以纳秒为单位).当尝试使用已建立的流(或批处理)处理框架时,我们必须保持此时间戳解析度.或者进一步发展,因为我们最近在某些系统上突破了1 Gsps阈值.因此,我的问题.

Background of my question: I'm working in an environment with quite a number of proprietary implementations for data storage and processing and thinking about some organization/optimization presently. We are doing plasma physics experiments with a lot of different diagnostic/measurement systems at various sampling rates, now up to "above Giga samples per second". The one common fact/assumption in our systems is that each sample does have a recorded event time in nanosecond resolution. When trying to employ an established stream (or also batch) processing framework, we would have to keep this timestamp resolution. Or go even further as we recently breached the 1 Gsps threshold with some systems. Hence my question.

推荐答案

如果不清楚,请注意事件时间与处理时间之间的差异:

In case this is not clear, you should be aware of the difference between event time and processing time:

事件时间-从源头生成事件的时间

event time - time of generation of the event at the source

处理时间-处理引擎中事件执行的时间

processing time - time of event execution within processing engine

src: Flink文档

AFAIK Storm不支持​​事件时间,Spark的支持有限.这样一来,就可以考虑使用Kafka Streams和Flink.

AFAIK Storm doesn't support event time and Spark has limited support. That leaves Kafka Streams and Flink for consideration.

Flink使用长类型作为时间戳.在 docs 中提到该值是自1970-01-01T00:00:00Z以来的毫秒数,但是AFAIK,当您使用事件时间特征时,唯一的进度度量是事件时间戳.因此,如果您可以将值放在一个较长的范围内,那么它应该是可行的.

Flink uses long type for timestamps. It is mentioned in the docs that this value is for milliseconds since 1970-01-01T00:00:00Z, but AFAIK, when you use event time characteristic, the only measure of progress are event timestamps. So, if you can fit your values into the long range, then it should be doable.

通常,水印(基于时间戳)用于测量窗口,触发器等中事件时间的进度.因此,如果您使用:

In general watermarks (based on timestamps) are used for measuring the progress of event time in windows, triggers etc. So, if you use:

  • AssignerWithPeriodicWatermarks ,那么即使在使用事件时间特性的情况下,也会在处理时域中在config(自动水印间隔)中定义的间隔中发射新的水印.有关详细信息,请参见例如org.apache.flink.streaming.runtime.operators.TimestampsAndPeriodicWatermarksOperator#open()方法,其中注册了处理时间计时器.因此,如果将自动水印设置为500ms,则每500ms的处理时间(取自System.currentTimeMillis())将发出一个新的水印,但是水印的时间戳基于事件的时间戳.

  • AssignerWithPeriodicWatermarks then a new watermark is emitted in intervals defined in config (autowatermark interval) in processing time domain - even when event time characteristic is used. For details see eg org.apache.flink.streaming.runtime.operators.TimestampsAndPeriodicWatermarksOperator#open() method, where a timer in processing time is registered. So, if autowatermark is set to 500ms, then every 500ms of processing time (as taken from System.currentTimeMillis()) a new watermark is emitted, but the timestamp of the watermark is based on the timestamp from events.

AssignerWithPunctuatedWatermarks ,则可以在文档中找到org.apache.flink.streaming.api.datastream.DataStream#assignTimestampsAndWatermarks(org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks<T>)的最佳描述:

AssignerWithPunctuatedWatermarks then the best description can be found in docs for org.apache.flink.streaming.api.datastream.DataStream#assignTimestampsAndWatermarks(org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks<T>):

为数据流中的元素分配时间戳,并根据元素本身创建水印以表示事件时间进度.

Assigns timestamps to the elements in the data stream and creates watermarks to signal event time progress based on the elements themselves.

此方法仅基于流元素创建水印.对于通过AssignerWithPunctuatedWatermarks#extractTimestamp(Object, long)处理的每个元素,将调用AssignerWithPunctuatedWatermarks#checkAndGetNextWatermark(Object, long)方法,并且如果返回的水印则发出新的水印.值是非负值,并且大于先前的水印.

This method creates watermarks based purely on stream elements. For each element that is handled via AssignerWithPunctuatedWatermarks#extractTimestamp(Object, long), the AssignerWithPunctuatedWatermarks#checkAndGetNextWatermark(Object, long) method is called, and a new watermark is emitted, if the returned watermark value is non-negative and greater than the previous watermark.

当数据流嵌入水印元素或某些元素带有可用于确定当前事件时间水印的标记时,此方法很有用. 此操作使程序员可以完全控制水印的生成.用户应注意,过于激进的水印生成(即每秒生成数百个水印)会降低性能.

This method is useful when the data stream embeds watermark elements, or certain elements carry a marker that can be used to determine the current event time watermark. This operation gives the programmer full control over the watermark generation. Users should be aware that too aggressive watermark generation (i.e., generating hundreds of watermarks every second) can cost some performance.

要了解水印的工作原理,强烈建议阅读:泰勒·阿基道(Tyler Akidau)上的流媒体102

To understand how watermarks work, this read is highly recommended: Tyler Akidau on Streaming 102

这篇关于流数据处理和纳秒级分辨率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆