函数式编程(特别是 Scala 和 Scala API)中的 reduce 和 foldLeft/fold 之间的区别? [英] Difference between reduce and foldLeft/fold in functional programming (particularly Scala and Scala APIs)?

查看:24
本文介绍了函数式编程(特别是 Scala 和 Scala API)中的 reduce 和 foldLeft/fold 之间的区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么 Scala 和像 Spark 和 Scalding 这样的框架都有 reducefoldLeft?那么reducefold有什么区别?

Why do Scala and frameworks like Spark and Scalding have both reduce and foldLeft? So then what's the difference between reduce and fold?

推荐答案

reduce vs foldLeft

在与此主题相关的任何其他 stackoverflow 答案中没有明确提到的一个很大的区别是,reduce 应该被赋予一个 可交换的幺半群,即一个操作既可交换又可结合.这意味着操作可以并行化.

reduce vs foldLeft

A big big difference, not mentioned in any other stackoverflow answer relating to this topic clearly, is that reduce should be given a commutative monoid, i.e. an operation that is both commutative and associative. This means the operation can be parallelized.

这种区别对于大数据/MPP/分布式计算非常重要,也是reduce存在的全部原因.集合可以被切碎,reduce 可以对每个chunk 进行操作,然后reduce 可以对每个chunk 的结果进行操作——实际上chunking 的层次是不需要停止的一层深.我们也可以切碎每一块.这就是为什么在给定无限数量的 CPU 的情况下,对列表中的整数求和是 O(log N) 的原因.

This distinction is very important for Big Data / MPP / distributed computing, and the entire reason why reduce even exists. The collection can be chopped up and the reduce can operate on each chunk, then the reduce can operate on the results of each chunk - in fact the level of chunking need not stop one level deep. We could chop up each chunk too. This is why summing integers in a list is O(log N) if given an infinite number of CPUs.

如果你只看签名,没有理由 reduce 存在,因为你可以用 reducefoldLeft 实现你能做的一切>.foldLeft 的功能比 reduce 的功能更强大.

If you just look at the signatures there is no reason for reduce to exist because you can achieve everything you can with reduce with a foldLeft. The functionality of foldLeft is a greater than the functionality of reduce.

但是你不能并行化一个 foldLeft,所以它的运行时间总是 O(N)(即使你输入一个可交换的幺半群).这是因为假设操作不是一个可交换的幺半群,因此累积值将由一系列顺序聚合计算.

But you cannot parallelize a foldLeft, so its runtime is always O(N) (even if you feed in a commutative monoid). This is because it's assumed the operation is not a commutative monoid and so the cumulated value will be computed by a series of sequential aggregations.

foldLeft 不假设可交换性或结合性.关联性提供了拆分集合的能力,而交换性使累积变得容易,因为顺序并不重要(因此聚合每个块的每个结果的顺序并不重要).严格来说,交换性不是并行化所必需的,例如分布式排序算法,它只是使逻辑更容易,因为您不需要给块排序.

foldLeft does not assume commutativity nor associativity. It's associativity that gives the ability to chop up the collection, and it's commutativity that makes cumulating easy because order is not important (so it doesn't matter which order to aggregate each of the results from each of the chunks). Strictly speaking commutativity is not necessary for parallelization, for example distributed sorting algorithms, it just makes the logic easier because you don't need to give your chunks an ordering.

如果您查看 reduce 的 Spark 文档,它特别指出...可交换和关联二元运算符"

If you have a look at the Spark documentation for reduce it specifically says "... commutative and associative binary operator"

http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD

这里证明了 reduce 不仅仅是 foldLeft

Here is proof that reduce is NOT just a special case of foldLeft

scala> val intParList: ParSeq[Int] = (1 to 100000).map(_ => scala.util.Random.nextInt()).par

scala> timeMany(1000, intParList.reduce(_ + _))
Took 462.395867 milli seconds

scala> timeMany(1000, intParList.foldLeft(0)(_ + _))
Took 2589.363031 milli seconds

减少与折叠

现在这是它更接近 FP/数学根源的地方,并且解释起来有点棘手.Reduce 被正式定义为 MapReduce 范式的一部分,它处理无序集合(多重集),Fold 是根据递归(参见 catamorphism)正式定义的,因此为集合假定了一个结构/序列.

reduce vs fold

Now this is where it gets a little closer to the FP / mathematical roots, and a little trickier to explain. Reduce is defined formally as part of the MapReduce paradigm, which deals with orderless collections (multisets), Fold is formally defined in terms of recursion (see catamorphism) and thus assumes a structure / sequence to the collections.

Scalding 中没有 fold 方法,因为在(严格的)Map Reduce 编程模型下我们无法定义 fold 因为块没有排序和 fold 只需要结合性,不需要交换性.

There is no fold method in Scalding because under the (strict) Map Reduce programming model we cannot define fold because chunks do not have an ordering and fold only requires associativity, not commutativity.

简而言之,reduce 无需累积顺序即可工作,fold 需要累积顺序,而累积顺序需要零值而不是存在区分它们的零值.严格来说reduce应该作用于一个空集合,因为它的零值可以通过取任意值x然后求解来推导出x op y = x,但这不适用于非交换运算,因为可能存在不同的左右零值(即 x op y != y op x).当然,Scala 不会费心计算这个零值是什么,因为这需要做一些数学运算(这可能是无法计算的),所以只是抛出一个异常.

Put simply, reduce works without an order of cumulation, fold requires an order of cumulation and it is that order of cumulation that necessitates a zero value NOT the existence of the zero value that distinguishes them. Strictly speaking reduce should work on an empty collection, because its zero value can by deduced by taking an arbitrary value x and then solving x op y = x, but that doesn't work with a non-commutative operation as there can exist a left and right zero value that are distinct (i.e. x op y != y op x). Of course Scala doesn't bother to work out what this zero value is as that would require doing some mathematics (which are probably uncomputable), so just throws an exception.

似乎(在词源学中经常出现这种情况)这种原始的数学含义已经丢失,因为编程中唯一明显的区别是签名.结果是 reduce 变成了 fold 的同义词,而不是保留它在 MapReduce 中的原始含义.现在这些术语经常互换使用,并且在大多数实现中表现相同(忽略空集合).怪异会因特殊性而加剧,例如在 Spark 中,我们现在将解决这些问题.

It seems (as is often the case in etymology) that this original mathematical meaning has been lost, since the only obvious difference in programming is the signature. The result is that reduce has become a synonym for fold, rather than preserving it's original meaning from MapReduce. Now these terms are often used interchangeably and behave the same in most implementations (ignoring empty collections). Weirdness is exacerbated by peculiarities, like in Spark, that we shall now address.

所以Spark确实fold,但是子结果(每个分区一个)组合的顺序(在撰写本文时)是相同的顺序其中任务完成 - 因此是非确定性的.感谢@CafeFeed 指出fold 使用runJob,在阅读代码后我意识到它是不确定的.Spark 具有 treeReduce 但没有 treeFold 会造成进一步的混乱.

So Spark does have a fold, but the order in which sub results (one for each partition) are combined (at the time of writing) is the same order in which tasks are completed - and thus non-deterministic. Thanks to @CafeFeed for pointing out that fold uses runJob, which after reading through the code I realised that it's non-deterministic. Further confusion is created by Spark having a treeReduce but no treeFold.

即使应用于非空序列,reducefold 之间也存在差异.前者被定义为具有任意顺序集合的 MapReduce 编程范式的一部分 (http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf) 并且应该假设运算符除了具有关联性以给出确定性结果之外还具有可交换性.后者是根据原形定义的,要求集合具有序列的概念(或者递归定义,如链表),因此不需要可交换运算符.

There is a difference between reduce and fold even when applied to non-empty sequences. The former is defined as part of the MapReduce programming paradigm on collections with arbitrary order (http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf) and one ought to assume operators are commutative in addition to being associative to give deterministic results. The latter is defined in terms of catomorphisms and requires that the collections have a notion of sequence (or are defined recursively, like linked lists), thus do not require commutative operators.

在实践中,由于编程的非数学性质,reducefold 倾向于以相同的方式运行,要么正确(如在 Scala 中)要么错误地(如在火花).

In practice due to the unmathematical nature of programming, reduce and fold tend to behave in the same way, either correctly (like in Scala) or incorrectly (like in Spark).

我的观点是,如果在 Spark 中完全放弃使用 fold 一词,就可以避免混淆.至少 spark 在他们的文档中有一个注释:

My opinion is that confusion would be avoided if use of the term fold was completely dropped in Spark. At least spark does have a note in their documentation:

这与为Scala 等函数式语言中的非分布式集合.

This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala.

这篇关于函数式编程(特别是 Scala 和 Scala API)中的 reduce 和 foldLeft/fold 之间的区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆