Apache Flink 与 Apache Spark 作为大规模机器学习平台? [英] Apache Flink vs Apache Spark as platforms for large-scale machine learning?

查看:36
本文介绍了Apache Flink 与 Apache Spark 作为大规模机器学习平台?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

谁能比较 Flink 和 Spark 作为机器学习平台?哪个更适合迭代算法?链接到 Flink 与 Spark 的一般讨论:什么是Apache Spark 和 Apache Flink 的区别?

Could anyone compare Flink and Spark as platforms for machine learning? Which is potentially better for iterative algorithms? Link to the general Flink vs Spark discussion: What is the difference between Apache Spark and Apache Flink?

推荐答案

免责声明:我是 Apache Flink 的 PMC 成员.我的回答侧重于在 Flink 和 Spark 中执行迭代的差异.

Disclaimer: I'm a PMC member of Apache Flink. My answer focuses on the differences of executing iterations in Flink and Spark.

Apache Spark 通过循环展开执行迭代.这意味着每次迭代都会调度和执行一组新的任务/操作符.Spark 非常有效地做到了这一点,因为它非常擅长低延迟任务调度(顺便说一句,Spark 流使用相同的机制)并在迭代中将数据缓存在内存中.因此,每次迭代都对保存在内存中的前一次迭代的结果进行操作.在 Spark 中,迭代是作为常规 for 循环实现的(参见 逻辑回归示例).

Apache Spark executes iterations by loop unrolling. This means that for each iteration a new set of tasks/operators is scheduled and executed. Spark does that very efficiently because it is very good at low-latency task scheduling (same mechanism is used for Spark streaming btw.) and caches data in-memory across iterations. Therefore, each iteration operates on the result of the previous iteration which is held in memory. In Spark, iterations are implemented as regular for-loops (see Logistic Regression example).

Flink 以循环数据流的形式执行带有迭代的程序.这意味着数据流程序(及其所有操作符)仅被调度一次,并且数据从迭代的尾部反馈到其头部.基本上,数据在迭代中围绕运算符循环流动.由于操作符只被调度一次,它们可以在所有迭代中保持一个状态.Flink 的 API 提供了两个专用的 迭代操作符 指定迭代:1) 批量迭代,在概念上类似于循环展开,以及 2) 增量迭代.Delta 迭代可以显着加快某些算法的速度,因为每次迭代的工作量会随着迭代次数的增加而减少.例如,增量迭代 PageRank 实现的第 10 次迭代比第一次迭代完成得快得多.

Flink executes programs with iterations as cyclic data flows. This means that a data flow program (and all its operators) is scheduled just once and the data is fed back from the tail of an iteration to its head. Basically, data is flowing in cycles around the operators within an iteration. Since operators are just scheduled once, they can maintain a state over all iterations. Flink's API offers two dedicated iteration operators to specify iterations: 1) bulk iterations, which are conceptually similar to loop unrolling, and 2) delta iterations. Delta iterations can significantly speed up certain algorithms because the work in each iteration decreases as the number of iterations goes on. For example the 10th iteration of a delta iteration PageRank implementation completes much faster than the first iteration.

这篇关于Apache Flink 与 Apache Spark 作为大规模机器学习平台?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆