Flink和Storm之间的主要区别是什么? [英] What is/are the main difference(s) between Flink and Storm?

查看:427
本文介绍了Flink和Storm之间的主要区别是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

与Spark相比,Flink的 ,在我看来,这是错误的比较,因为它将窗口事件处理系统与微批处理进行了比较;同样,将Flink与Samza进行比较对我来说也没有太大意义.在这两种情况下,它都比较实时事件和批处理事件处理策略,即使在Samza情况下,即使规模较小.但是我想知道Flink与Storm的比较,Storm在概念上似乎与它更相似.

Flink has been compared to Spark, which, as I see it, is the wrong comparison because it compares a windowed event processing system against micro-batching; Similarly, it does not make that much sense to me to compare Flink to Samza. In both cases it compares a real-time vs. a batched event processing strategy, even if at a smaller "scale" in the case of Samza. But I would like to know how Flink compares to Storm, which seems conceptually much more similar to it.

我发现(幻灯片4)记录了以下主要区别: Flink的可调延迟".另一个提示似乎是 Apache Storm是Flink的流分析功能,它提供了高级API,并使用了更轻量级的容错策略来提供一次精确的处理保证."

I have found this (Slide #4) documenting the main difference as "adjustable latency" for Flink. Another hint seems to be an article by Slicon Angle that suggest that Flink better integrates into a Spark or HadoopMR world, but no actual details are mentioned or referenced. Finally, Fabian Hueske himself notes in an interview that "Compared to Apache Storm, the stream analysis functionality of Flink offers a high-level API and uses a more light-weight fault tolerance strategy to provide exactly-once processing guarantees."

这对我来说有点稀疏,我不太明白这一点. 有人可以解释一下Flink完全解决了Storm中流处理的哪些问题吗? API问题及其更轻量级的容错策略"指的是Hueske是什么意思?

All that is a bit sparse for me and I do not quite get the point. Can someone explain what problem(s?) with stream processing in Storm is (are?) exactly solved by Flink? What is Hueske referring to by the API issues and their "more light-weight fault tolerance strategy"?

推荐答案

免责声明:我是Apache Flink提交者和PMC成员,并且仅熟悉Storm的高级设计,而不是其内部.

Disclaimer: I'm an Apache Flink committer and PMC member and only familiar with Storm's high-level design, not its internals.

Apache Flink是用于统一流和批处理的框架.由于并行任务之间的流水线式数据传输(包括流水线洗牌),Flink的运行时本机支持这两个域.记录从生产任务立即传送到接收任务(在收集到网络传输的缓冲区中之后).批处理作业可以选择使用阻止数据传输来执行.

Apache Flink is a framework for unified stream and batch processing. Flink's runtime natively supports both domains due to pipelined data transfers between parallel tasks which includes pipelined shuffles. Records are immediately shipped from producing tasks to receiving tasks (after being collected in a buffer for network transfer). Batch jobs can be optionally executed using blocking data transfers.

Apache Spark是一个还支持批处理和流处理的框架. Flink的批处理API看起来非常相似,并解决了与Spark类似的用例,但内部结构有所不同.对于流传输,两个系统都采用截然不同的方法(迷你批处理与流传输),这使它们适用于不同类型的应用程序.我想说比较Spark和Flink是有效和有用的,但是,Spark与Flink并不是最相似的流处理引擎.

Apache Spark is a framework that also supports batch and stream processing. Flink's batch API looks quite similar and addresses similar use cases as Spark but differs in the internals. For streaming, both systems follow very different approaches (mini-batches vs. streaming) which makes them suitable for different kinds of applications. I would say comparing Spark and Flink is valid and useful, however, Spark is not the most similar stream processing engine to Flink.

提到最初的问题,Apache Storm是一个没有批处理功能的数据流处理器.实际上,Flink的流水线引擎在内部看起来与Storm类似,即Flink并行任务的接口类似于Storm的螺栓. Storm和Flink的共同点在于,它们旨在通过流水线式数据传输来实现低延迟流处理.但是,与Storm相比,Flink提供了更高级的API. Flink的DataStream API并未提供具有一个或多个阅读器和收集器的螺栓功能,而是提供了诸如Map,GroupBy,Window和Join之类的功能.使用Storm时,必须手动实现许多此功能.另一个区别是处理语义. Storm保证了至少一次处理,而Flink提供了一次.提供这些处理保证的实现方式相差很多.尽管Storm使用记录级别的确认,但Flink使用Chandy-Lamport算法的变体.简而言之,数据源会定期将标记插入到数据流中.每当操作员收到这样的标记时,它都会检查其内部状态.当所有数据接收器都接收到标记时,将提交标记(以及之前已处理的所有记录).如果发生故障,当所有源操作员看到最后提交的标记时,它们将重置为它们的状态,并继续进行处理.这种标记检查点方法比Storm的记录级确认更轻便.此幻灯片组和相应的

Coming to the original question, Apache Storm is a data stream processor without batch capabilities. In fact, Flink's pipelined engine internally looks a bit similar to Storm, i.e., the interfaces of Flink's parallel tasks are similar to Storm's bolts. Storm and Flink have in common that they aim for low latency stream processing by pipelined data transfers. However, Flink offers a more high-level API compared to Storm. Instead of implementing the functionality of a bolts with one or more readers and collectors, Flink's DataStream API provides functions such as Map, GroupBy, Window, and Join. A lot of this functionality must be manually implemented when using Storm. Another difference are processing semantics. Storm guarantees at-least-once processing while Flink provides exactly-once. The implementations which give these processing guarantees differ quite a bit. While Storm uses record-level acknowledgments, Flink uses a variant of the Chandy-Lamport algorithm. In a nutshell, data sources periodically inject markers into the data stream. Whenever an operator receives such a marker, it checkpoints its internal state. When a marker was received by all data sinks, the marker (and all records which have been processed before) are committed. In case of a failure, all sources operators are reset to their state when they saw the last committed marker and processing is continued. This marker-checkpoint approach is more lightweight than Storm's record-level acknowledgments. This slide set and the corresponding talk discuss Flink's streaming processing approach including fault tolerance, checkpointing, and state handling.

Storm还提供了一次仅称为Trident的高级API.但是,Trident是基于迷你批处理的,因此与Flink相比更类似于Spark.

Storm also offers an exactly-once, high-level API called Trident. However, Trident is based on mini-batches and hence more similar to Spark than Flink.

Flink的可调整延迟时间是指Flink将记录从一个任务发送到另一个任务的方式.我之前说过,Flink使用流水线数据传输并在产生记录后立即转发记录.为了提高效率,这些记录收集在一个缓冲区中,一旦缓冲区已满或达到某个时间阈值,该缓冲区就会通过网络发送.此阈值控制记录的延迟,因为它指定了记录将保留在缓冲区中而不发送到下一个任务的最长时间.但是,它不能用来为记录从进入到离开程序所花费的时间提供严格的保证,因为这还取决于任务中的处理时间以及网络传输的次数.

Flink's adjustable latency refers to the way that Flink sends records from one task to the other. I said before, that Flink uses pipelined data transfers and forwards records as soon as they are produced. For efficiency, these records are collected in a buffer which is sent over the network once it is full or a certain time threshold is met. This threshold controls the latency of records because it specifies the maximum amount of time that a record will stay in a buffer without being sent to the next task. However, it cannot be used to give hard guarantees about the time it takes for a record from entering to leaving a program because this also depends on the processing time within tasks and the number of network transfers among other things.

这篇关于Flink和Storm之间的主要区别是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆