差异之间的减少和foldLeft /折叠功能编程(尤其是Scala和斯卡拉API)的? [英] Difference between reduce and foldLeft/fold in functional programming (particularly Scala and Scala APIs)?
问题描述
为什么Scala和框架,如星火烫兼得减少
和 foldLeft
?这样的话有什么的区别减少
和折叠
?
Why does Scala, and frameworks like Spark and Scalding have both reduce
and foldLeft
? So then what's the difference between reduce
and fold
?
推荐答案
一个很大很大的区别,在有关这个话题显然任何其他计算器回答没有提到,就是减少
应给予的可交换幺的,即,既可交换和可结合的操作。这意味着该操作可以并行化
reduce vs foldLeft
A big big difference, not mentioned in any other stackoverflow answer relating to this topic clearly, is that reduce
should be given a commutative monoid, i.e. an operation that is both commutative and associative. This means the operation can be parallelized.
这区别是对于大数据/ MPP /分布式计算非常重要,整个之所以减少
的存在。收集可以切碎和减少
可以在每个块进行操作,那么减少
可结果操作每块 - 事实上组块的水平需要不停深一个层次。我们可以砍了每个块了。这就是为什么在列表总结整数为O(日志N)如果给CPU的无限数量。
This distinction is very important for Big Data / MPP / distributed computing, and the entire reason why reduce
even exists. The collection can be chopped up and the reduce
can operate on each chunk, then the reduce
can operate on the results of each chunk - in fact the level of chunking need not stop one level deep. We could chop up each chunk too. This is why summing integers in a list is O(log N) if given an infinite number of CPUs.
如果你只是看签名是没有理由减少
存在,因为你可以实现与一切你可以减少
与 foldLeft
。 foldLeft
的功能是一个比的功能降低更大
。
If you just look at the signatures there is no reason for reduce
to exist because you can achieve everything you can with reduce
with a foldLeft
. The functionality of foldLeft
is a greater than the functionality of reduce
.
但是您不能并行化 foldLeft
,所以它的运行时间总是O(N)(即使你在一个交换独异食)。这是因为它是假设的操作的不的一个可交换独异,因此累积值将由一系列连续聚合的进行计算。
But you cannot parallelize a foldLeft
, so its runtime is always O(N) (even if you feed in a commutative monoid). This is because it's assumed the operation is not a commutative monoid and so the cumulated value will be computed by a series of sequential aggregations.
foldLeft
不承担交换性也不关联性。它的关联性,让砍集合的能力,它的交换性,使累积容易因为顺序并不重要(因此也没关系聚集每个从每个块的结果的顺序)。严格地说交换性是没有必要的并行化,例如分布式排序算法,它只是使逻辑更容易,因为你不需要给你块进行排序。
foldLeft
does not assume commutativity nor associativity. It's associativity that gives the ability to chop up the collection, and it's commutativity that makes cumulating easy because order is not important (so it doesn't matter which order to aggregate each of the results from each of the chunks). Strictly speaking commutativity is not necessary for parallelization, for example distributed sorting algorithms, it just makes the logic easier because you don't need to give your chunks an ordering.
如果你看看星火文档减少
它明确表示...可交换并且关联二元运算符
If you have a look at the Spark documentation for reduce
it specifically says "... commutative and associative binary operator"
<一个href=\"http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD\">http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD
下面是证明了减少
是 foldLeft
scala> val intParList: ParSeq[Int] = (1 to 100000).map(_ => scala.util.Random.nextInt()).par
scala> timeMany(1000, intParList.reduce(_ + _))
Took 462.395867 milli seconds
scala> timeMany(1000, intParList.foldLeft(0)(_ + _))
Took 2589.363031 milli seconds
减少VS倍
现在这是它变得有点接近FP范畴理论根源,以及有点麻烦解释一下。
reduce vs fold
Now this is where it gets a little closer to the FP Category Theory roots, and a little trickier to explain.
有没有因为根据(严格)的Map Reduce编程模型,我们不能定义折叠滚烫
方法折叠
因为块没有排序和折叠
只需要关联性,而不是交换性。星火的确实的有无折叠
,因为它的框架是一个超集地图的减少编程模型,并可以责令其块。好了,你居然能在Hadoop中做到这一点了,但是烫似乎并没有在公开的版本,我熟悉了这个功能。
There is no fold
method in Scalding because under the (strict) Map Reduce programming model we cannot define fold
because chunks do not have an ordering and fold
only requires associativity, not commutativity. Spark does have fold
because its framework is a super-set of the Map Reduce programming model and can order its chunks. Well you can actually do this in Hadoop too, but Scalding doesn't seem to expose this functionality in the version I'm familiar with.
简而言之,减少
作品,未经累积的订单,折叠
需要累积的订单,这是累积的顺序就必须一个零值不是零值区分他们的存在。严格地说减少
的应的上一个空的征集工作,因为它的零值可以通过利用一个任意值由演绎 X
,然后解 X运算Y = X
,但这并不与非交换操作工作作为可以存在左,右零值这是不同的(即 X OP Y!= Y运算X
)。当然,Scala没有刻意去锻炼这个零值是因为这需要做一些数学(这可能是不可计算),所以只是抛出一个异常。
Put simply, reduce
works without an order of cumulation, fold
requires an order of cumulation and it is that order of cumulation that necessitates a zero value NOT the existence of the zero value that distinguishes them. Strictly speaking reduce
should work on an empty collection, because its zero value can by deduced by taking an arbitrary value x
and then solving x op y = x
, but that doesn't work with a non-commutative operation as there can exist a left and right zero value that are distinct (i.e. x op y != y op x
). Of course Scala doesn't bother to work out what this zero value is as that would require doing some mathematics (which are probably uncomputable), so just throws an exception.
之间的这种差别减少*
和倍*
是一个FP历史惯例,有其根源在于范畴理论。我现在希望与此并行的差异深将不再被忽视:)
This difference between reduce*
and fold*
is a FP historical convention and has its roots in Category Theory. I hope now this deep difference relating to parallelization will no longer go unnoticed :)
这篇关于差异之间的减少和foldLeft /折叠功能编程(尤其是Scala和斯卡拉API)的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!