阿帕奇弗林克VS Apache的火花为大规模机器学习的平台? [英] Apache Flink vs Apache Spark as platforms for large-scale machine learning?

查看:288
本文介绍了阿帕奇弗林克VS Apache的火花为大规模机器学习的平台?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任何人都可以比较弗林克和Spark作为机器学习的平台?这是迭代算法可能更好?链接到一般弗林克VS星火讨论:<一href=\"http://stackoverflow.com/questions/28082581/what-is-the-differences-between-apache-spark-and-apache-flink/\">What就是Apache星火和Apache弗林克之间的区别是什么?

Could anyone compare Flink and Spark as platforms for machine learning? Which is potentially better for iterative algorithms? Link to the general Flink vs Spark discussion: What is the differences between Apache Spark and Apache Flink?

推荐答案

免责声明:我的Apache弗林克的PMC成员。我的回答着重于弗林克和星火执行迭代的差异。

Disclaimer: I'm a PMC member of Apache Flink. My answer focuses on the differences of executing iterations in Flink and Spark.

阿帕奇星火执行由循环展开迭代。这意味着,对于每次迭代一组新的任务/经营者定和执行。星火这是否非常有效,因为它是在低延迟的任务调度非常好(同一机构用于火花流BTW)和内存中跨迭代数据缓存。因此,每次迭代上保存在内存中的previous迭代的结果进行操作。在星火,迭代实现为普通的for循环(见 Logistic回归例如)。

Apache Spark executes iterations by loop unrolling. This means that for each iteration a new set of tasks/operators is scheduled and executed. Spark does that very efficiently because it is very good at low-latency task scheduling (same mechanism is used for Spark streaming btw.) and caches data in-memory across iterations. Therefore, each iteration operates on the result of the previous iteration which is held in memory. In Spark, iterations are implemented as regular for-loops (see Logistic Regression example).

弗林克执行与迭代循环作为数据流的程序。这意味着,一个数据流的程序(及其所有运营商)定只是一次并且该数据被从一个迭代到其头部的尾部反馈。基本上,数据是流动在迭代内围绕运算周期。由于运营商都只是安排一次,就可以在所有的迭代保持状态。弗林克的API提供了两个专用的<一个href=\"http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#iteration-operators\">iteration运营商来指定迭代:1)批量的迭代,其概念上类似于循环展开,和2)<一href=\"http://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.html#delta-iterate-operator\">delta迭代的。因为在每一次迭代的工作会随着迭代次数推移增量迭代可以显著加速某些算法。例如增量迭代的PageRank实施的第10次迭代完成比第一次迭代更快。

Flink executes programs with iterations as cyclic data flows. This means that a data flow program (and all its operators) is scheduled just once and the data is fed back from the tail of an iteration to its head. Basically, data is flowing in cycles around the operators within an iteration. Since operators are just scheduled once, they can maintain a state over all iterations. Flink's API offers two dedicated iteration operators to specify iterations: 1) bulk iterations, which are conceptually similar to loop unrolling, and 2) delta iterations. Delta iterations can significantly speed up certain algorithms because the work in each iteration decreases as the number of iterations goes on. For example the 10th iteration of a delta iteration PageRank implementation completes much faster than the first iteration.

这篇关于阿帕奇弗林克VS Apache的火花为大规模机器学习的平台?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆