Apache Spark中的reduce()与fold() [英] reduce() vs. fold() in Apache Spark
问题描述
reduce
与fold
在技术实施方面有何区别?
What is the difference between reduce
vs. fold
with respect to their technical implementation?
我了解它们的签名不同,因为fold
接受附加参数(即初始值),该参数会添加到每个分区输出中.
I understand that they differ by their signature as fold
accepts additional parameter (i.e. initial value) which gets added to each partition output.
- 有人可以告知这两个操作的用例吗?
- 在哪种情况下考虑将0用作
fold
,哪个会更好?
- Can someone tell about use case for these two actions?
- Which would perform better in which scenario consider 0 is used for
fold
?
先谢谢了.
推荐答案
在性能方面,没有任何实际差异:
There is no practical difference when it comes to performance whatsoever:
-
RDD.fold
操作正在使用foldLeft
实现的分区Iterators
上使用fold
. -
RDD.reduce
在分区Iterators
上使用reduceLeft
.
RDD.fold
action is usingfold
on the partitionIterators
which is implemented usingfoldLeft
.RDD.reduce
is usingreduceLeft
on the partitionIterators
.
Both methods keep mutable accumulator and process partitions sequentially using simple loops with foldLeft
implemented like this:
foreach (x => result = op(result, x))
for (x <- self) {
if (first) {
...
}
else acc = op(acc, x)
}
Spark中这些方法之间的实际差异仅与它们在空集合上的行为以及使用可变缓冲区的能力有关(可以说,这与性能有关).您会在为什么在Spark中需要进行折叠操作?
Practical difference between these methods in Spark is only related to their behavior on empty collections and ability to use mutable buffer (arguably it is related to performance). You'll find some discussion in Why is the fold action necessary in Spark?
此外,整体处理模型没有区别:
Moreover there is no difference in the overall processing model:
- 每个分区都使用单个线程顺序处理.
- 使用多个执行程序/执行程序线程并行处理分区.
- 最终合并是使用驱动程序上的单个线程顺序执行的.
这篇关于Apache Spark中的reduce()与fold()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!