在实践中,迷你批处理与实时流之间有什么区别(不是理论上的区别)? [英] What is the difference between mini-batch vs real time streaming in practice (not theory)?

查看:188
本文介绍了在实践中,迷你批处理与实时流之间有什么区别(不是理论上的区别)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在实践中,迷你批处理与实时流之间有什么区别(不是理论上的区别)?从理论上讲,我了解到迷你批处理是在给定的时间范围内进行批处理,而实时流更像是在数据到达时执行某些操作,但是我最大的问题是为什么不使用带有epsilon时间帧(例如一毫秒)的迷你批处理?想了解为什么一个解决方案比其他解决方案有效的原因?

What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame whereas real time streaming is more like do something as the data arrives but my biggest question is why not have mini batch with epsilon time frame (say one millisecond) or I would like to understand reason why one would be an effective solution than other?

我最近遇到了一个例子,其中迷你批处理(Apache Spark)用于欺诈检测,而实时流处理(Apache Flink)用于欺诈预防.有人还评论说,迷你批处理不是防止欺诈的有效解决方案(因为目标是防止交易在发生时发生)现在,我想知道为什么迷你批处理(Spark)不会那么有效? 为什么以1毫秒的延迟运行微型批处理不是有效的方法?批处理是一种在所有地方都使用的技术,包括OS和内核TCP/IP堆栈,这些数据确实缓冲了磁盘或网络上的数据,因此这里说一个比另一个更有效的说服力是什么?

I recently came across one example where mini-batch (Apache Spark) is used for Fraud detection and real time streaming (Apache Flink) used for Fraud Prevention. Someone also commented saying mini-batches would not be an effective solution for fraud prevention (since the goal is to prevent the transaction from occurring as it happened) Now I wonder why this wouldn't be so effective with mini batch (Spark) ? Why is it not effective to run mini-batch with 1 millisecond latency? Batching is a technique used everywhere including the OS and the Kernel TCP/IP stack where the data to the disk or network are indeed buffered so what is the convincing factor here to say one is more effective than other?

推荐答案

免责声明:我是Apache Flink的提交者和PMC成员.我熟悉Spark Streaming的总体设计,但不了解其内部细节.

Disclaimer: I'm a committer and PMC member of Apache Flink. I'm familiar with the overall design of Spark Streaming but do not know its internals in detail.

Spark Streaming实现的小批量流处理模型如下:

The mini-batch stream processing model as implemented by Spark Streaming works as follows:

  • 流的记录收集在缓冲区(迷你批处理)中.
  • 定期,使用常规Spark作业处理收集的记录.这意味着,对于每个小型批生产,将计划并执行完整的分布式批处理作业.
  • 在运行作业时,将收集下一批记录.

那么,为什么每1毫秒运行一次迷你批处理没有效果?仅仅因为这将意味着每毫秒调度一次分布式批处理作业.即使Spark在安排作业方面非常快,但这也太多了.这也将大大降低可能的吞吐量.如果OS或TCP的批处理量变得太小,它们的批处理技术也无法很好地发挥作用.

So, why is it not effective to run a mini-batch every 1ms? Simply because this would mean to schedule a distributed batch job every millisecond. Even though Spark is very fast in scheduling jobs, this would be a bit too much. It would also significantly reduce the possible throughput. Batching techniques used in OSs or TCP do also not work well if their batches become too small.

这篇关于在实践中,迷你批处理与实时流之间有什么区别(不是理论上的区别)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆