Apache Spark中的reduce()与fold() [英] reduce() vs. fold() in Apache Spark

查看:211
本文介绍了Apache Spark中的reduce()与fold()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

reducefold在技术实施方面有何区别?

What is the difference between reduce vs. fold with respect to their technical implementation?

我了解它们的签名不同,因为fold接受附加参数(即初始值),该参数会添加到每个分区输出中.

I understand that they differ by their signature as fold accepts additional parameter (i.e. initial value) which gets added to each partition output.

  • 有人可以告知这两个操作的用例吗?
  • 在哪种情况下考虑将0用作fold,哪个会更好?
  • Can someone tell about use case for these two actions?
  • Which would perform better in which scenario consider 0 is used for fold?

先谢谢了.

推荐答案

在性能方面,没有任何实际差异:

There is no practical difference when it comes to performance whatsoever:

  • RDD.fold操作正在使用foldLeft实现的分区Iterators上使用fold.
  • RDD.reduce在分区Iterators上使用reduceLeft.
  • RDD.fold action is using fold on the partition Iterators which is implemented using foldLeft.
  • RDD.reduce is using reduceLefton the partition Iterators.

这两种方法都使用简单的循环和

Both methods keep mutable accumulator and process partitions sequentially using simple loops with foldLeft implemented like this:

foreach (x => result = op(result, x))

reduceLeft像这样:

for (x <- self) {
  if (first) {
    ...
  }
  else acc = op(acc, x)
}

Spark中这些方法之间的实际差异仅与它们在空集合上的行为以及使用可变缓冲区的能力有关(可以说,这与性能有关).您会在为什么在Spark中需要进行折叠操作?

Practical difference between these methods in Spark is only related to their behavior on empty collections and ability to use mutable buffer (arguably it is related to performance). You'll find some discussion in Why is the fold action necessary in Spark?

此外,整体处理模型没有区别:

Moreover there is no difference in the overall processing model:

  • 每个分区都使用单个线程顺序处理.
  • 使用多个执行程序/执行程序线程并行处理分区.
  • 最终合并是使用驱动程序上的单个线程顺序执行的.

这篇关于Apache Spark中的reduce()与fold()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆