实践中(非理论)小批量与实时流之间有什么区别? [英] What is the difference between mini-batch vs real time streaming in practice (not theory)?

查看:32
本文介绍了实践中(非理论)小批量与实时流之间有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

实践中(非理论)小批量与实时流之间有什么区别?理论上,我理解小批量是在给定的时间范围内进行批量处理,而实时流更像是在数据到达时做一些事情,但我最大的问题是为什么不使用带有 epsilon 时间范围(例如一毫秒)的小批量或我想了解为什么一个方法比其他方法更有效?

What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame whereas real time streaming is more like do something as the data arrives but my biggest question is why not have mini batch with epsilon time frame (say one millisecond) or I would like to understand reason why one would be an effective solution than other?

我最近遇到了一个示例,其中小批量 (Apache Spark) 用于欺诈检测,实时流 (Apache Flink) 用于欺诈预防.有人还评论说小批量不是预防欺诈的有效解决方案(因为目标是防止交易发生时发生)现在我想知道为什么小批量(Spark)不会如此有效?为什么以 1 毫秒的延迟运行小批量是无效的? 批处理是一种无处不在的技术,包括操作系统和内核 TCP/IP 堆栈,其中磁盘或网络的数据确实被缓冲,因此这里有什么令人信服的因素可以说一个比另一个更有效?

I recently came across one example where mini-batch (Apache Spark) is used for Fraud detection and real time streaming (Apache Flink) used for Fraud Prevention. Someone also commented saying mini-batches would not be an effective solution for fraud prevention (since the goal is to prevent the transaction from occurring as it happened) Now I wonder why this wouldn't be so effective with mini batch (Spark) ? Why is it not effective to run mini-batch with 1 millisecond latency? Batching is a technique used everywhere including the OS and the Kernel TCP/IP stack where the data to the disk or network are indeed buffered so what is the convincing factor here to say one is more effective than other?

推荐答案

免责声明:我是 Apache Flink 的提交者和 PMC 成员.我熟悉 Spark Streaming 的整体设计,但不了解其内部细节.

Disclaimer: I'm a committer and PMC member of Apache Flink. I'm familiar with the overall design of Spark Streaming but do not know its internals in detail.

Spark Streaming 实现的小批量流处理模型的工作原理如下:

The mini-batch stream processing model as implemented by Spark Streaming works as follows:

  • 流的记录收集在缓冲区中(小批量).
  • 定期使用常规 Spark 作业处理收集的记录.这意味着,对于每个小批量,一个完整的分布式批处理作业被调度和执行.
  • 在作业运行时,收集下一批的记录.

那么,为什么每 1ms 运行一次 mini-batch 是无效的?仅仅是因为这意味着每毫秒安排一个分布式批处理作业.尽管 Spark 在调度作业方面非常快,但这也有点过分了.它还会显着降低可能的吞吐量.如果 OS 或 TCP 中的批处理变得太小,那么它们的批处理技术也不能很好地工作.

So, why is it not effective to run a mini-batch every 1ms? Simply because this would mean to schedule a distributed batch job every millisecond. Even though Spark is very fast in scheduling jobs, this would be a bit too much. It would also significantly reduce the possible throughput. Batching techniques used in OSs or TCP do also not work well if their batches become too small.

这篇关于实践中(非理论)小批量与实时流之间有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆