差异之间的减少和foldLeft /折叠功能编程(尤其是Scala和斯卡拉API)的? [英] Difference between reduce and foldLeft/fold in functional programming (particularly Scala and Scala APIs)?

查看:199
本文介绍了差异之间的减少和foldLeft /折叠功能编程(尤其是Scala和斯卡拉API)的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么Scala和框架,如星火烫兼得减少 foldLeft ?这样的话有什么的区别减少折叠

Why does Scala, and frameworks like Spark and Scalding have both reduce and foldLeft? So then what's the difference between reduce and fold?

推荐答案

一个很大很大的区别,在有关这个话题显然任何其他计算器回答没有提到,就是减少应给予的可交换幺的,即,既可交换和可结合的操作。这意味着该操作可以并行化

reduce vs foldLeft

A big big difference, not mentioned in any other stackoverflow answer relating to this topic clearly, is that reduce should be given a commutative monoid, i.e. an operation that is both commutative and associative. This means the operation can be parallelized.

这区别是对于大数据/ MPP /分布式计算非常重要,整个之所以减少的存在。收集可以切碎和减少可以在每个块进行操作,那么减少可结果操作每块 - 事实上组块的水平需要不停深一个层次。我们可以砍了每个块了。这就是为什么在列表总结整数为O(日志N)如果给CPU的无限数量。

This distinction is very important for Big Data / MPP / distributed computing, and the entire reason why reduce even exists. The collection can be chopped up and the reduce can operate on each chunk, then the reduce can operate on the results of each chunk - in fact the level of chunking need not stop one level deep. We could chop up each chunk too. This is why summing integers in a list is O(log N) if given an infinite number of CPUs.

如果你只是看签名是没有理由减少存在,因为你可以实现与一切你可以减少 foldLeft foldLeft 的功能是一个比的功能降低更大

If you just look at the signatures there is no reason for reduce to exist because you can achieve everything you can with reduce with a foldLeft. The functionality of foldLeft is a greater than the functionality of reduce.

但是您不能并行化 foldLeft ,所以它的运行时间总是O(N)(即使你在一个交换独异食)。这是因为它是假设的操作的的一个可交换独异,因此累积值将由一系列连续聚合的进行计算。

But you cannot parallelize a foldLeft, so its runtime is always O(N) (even if you feed in a commutative monoid). This is because it's assumed the operation is not a commutative monoid and so the cumulated value will be computed by a series of sequential aggregations.

foldLeft 不承担交换性也不关联性。它的关联性,让砍集合的能力,它的交换性,使累积容易因为顺序并不重要(因此也没关系聚集每个从每个块的结果的顺序)。严格地说交换性是没有必要的并行化,例如分布式排序算法,它只是使逻辑更容易,因为你不需要给你块进行排序。

foldLeft does not assume commutativity nor associativity. It's associativity that gives the ability to chop up the collection, and it's commutativity that makes cumulating easy because order is not important (so it doesn't matter which order to aggregate each of the results from each of the chunks). Strictly speaking commutativity is not necessary for parallelization, for example distributed sorting algorithms, it just makes the logic easier because you don't need to give your chunks an ordering.

如果你看看星火文档减少它明确表示...可交换并且关联二元运算符

If you have a look at the Spark documentation for reduce it specifically says "... commutative and associative binary operator"

<一个href=\"http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD\">http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD

下面是证明了减少 foldLeft

scala> val intParList: ParSeq[Int] = (1 to 100000).map(_ => scala.util.Random.nextInt()).par

scala> timeMany(1000, intParList.reduce(_ + _))
Took 462.395867 milli seconds

scala> timeMany(1000, intParList.foldLeft(0)(_ + _))
Took 2589.363031 milli seconds

减少VS倍

现在这是它变得有点接近FP范畴理论根源,以及有点麻烦解释一下。

reduce vs fold

Now this is where it gets a little closer to the FP Category Theory roots, and a little trickier to explain.

有没有因为根据(严格)的Map Reduce编程模型,我们不能定义折叠滚烫方法折叠因为块没有排序和折叠只需要关联性,而不是交换性。星火的确实的有无折叠,因为它的框架是一个超集地图的减少编程模型,并可以责令其块。好了,你居然能在Hadoop中做到这一点了,但是烫似乎并没有在公开的版本,我熟悉了这个功能。

There is no fold method in Scalding because under the (strict) Map Reduce programming model we cannot define fold because chunks do not have an ordering and fold only requires associativity, not commutativity. Spark does have fold because its framework is a super-set of the Map Reduce programming model and can order its chunks. Well you can actually do this in Hadoop too, but Scalding doesn't seem to expose this functionality in the version I'm familiar with.

简而言之,减少作品,未经累积的订单,折叠需要累积的订单,这是累积的顺序就必须一个零值不是零值区分他们的存在。严格地说减少的上一个空的征集工作,因为它的零值可以通过利用一个任意值由演绎 X ,然后解 X运算Y = X ,但这并不与非交换操作工作作为可以存在左,右零值这是不同的(即 X OP Y!= Y运算X )。当然,Scala没有刻意去锻炼这个零值是因为这需要做一些数学(这可能是不可计算),所以只是抛出一个异常。

Put simply, reduce works without an order of cumulation, fold requires an order of cumulation and it is that order of cumulation that necessitates a zero value NOT the existence of the zero value that distinguishes them. Strictly speaking reduce should work on an empty collection, because its zero value can by deduced by taking an arbitrary value x and then solving x op y = x, but that doesn't work with a non-commutative operation as there can exist a left and right zero value that are distinct (i.e. x op y != y op x). Of course Scala doesn't bother to work out what this zero value is as that would require doing some mathematics (which are probably uncomputable), so just throws an exception.

之间的这种差别减少* 倍* 是一个FP历史惯例,有其根源在于范畴理论。我现在希望与此并行的差异深将不再被忽视:)

This difference between reduce* and fold* is a FP historical convention and has its roots in Category Theory. I hope now this deep difference relating to parallelization will no longer go unnoticed :)

这篇关于差异之间的减少和foldLeft /折叠功能编程(尤其是Scala和斯卡拉API)的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆