火花RDD的折叠方法的说明 [英] Explanation of fold method of spark RDD

查看:78
本文介绍了火花RDD的折叠方法的说明的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行为Hadoop-2.4预先构建的Spark-1.4.0(以本地模式),以计算DoubleRDD的平方和.我的Scala代码看起来像

I am running Spark-1.4.0 pre-built for Hadoop-2.4 (in local mode) to calculate the sum of squares of a DoubleRDD. My Scala code looks like

sc.parallelize(Array(2., 3.)).fold(0.0)((p, v) => p+v*v)

它给出了令人惊讶的结果97.0.

And it gave a surprising result 97.0.

与Scala版本的fold

Array(2., 3.).fold(0.0)((p, v) => p+v*v)

给出预期的答案13.0.

由于缺乏理解,我很可能在代码中犯了一些棘手的错误.我已经了解了RDD.fold()中使用的函数应该如何进行通信,否则结果可能取决于分区等.因此,例如,如果我将分区数更改为1,

It seems quite likely that I have made some tricky mistakes in the code due to a lack of understanding. I have read about how the function used in RDD.fold() should be communicative otherwise the result may depend on partitions and etc. So example, if I change the number of partitions to 1,

sc.parallelize(Array(2., 3.), 1).fold(0.0)((p, v) => p+v*v)

代码将在我的计算机上为我提供169.0

the code will give me 169.0 on my machine!

有人可以解释这里到底发生了什么吗?

Can someone explain what exactly is happening here?

推荐答案

使用给定的关联和交换函数以及中性的零值",汇总每个分区的元素,然后汇总所有分区的结果.允许函数op(t1,t2)修改t1并将其作为结果值返回,以避免对象分配;但是,它不应修改t2.

Aggregate the elements of each partition, and then the results for all the partitions, using a given associative and commutative function and a neutral "zero value". The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.

这与以Scala之类的功能语言为非分布式集合实现的折叠操作有些不同.该折叠操作可以单独应用于分区,然后将这些结果折叠为最终结果,而不是以某些定义的顺序将折叠顺序应用于每个元素.对于非交换函数,结果可能与应用于非分布式集合的折叠结果不同.

This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection.

为说明正在发生的事情,让我们尝试逐步模拟正在发生的事情:

To illustrate what is going on lets try to simulate what is going on step by step:

val rdd = sc.parallelize(Array(2., 3.))

val byPartition = rdd.mapPartitions(
    iter => Array(iter.fold(0.0)((p, v) => (p +  v * v))).toIterator).collect()

它为我们提供了类似于Array[Double] = Array(0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, 9.0)

It gives us something similar to this Array[Double] = Array(0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, 9.0) and

byPartition.reduce((p, v) => (p + v * v))

返回97

要注意的重要一点是,每次运行的结果可能会有所不同,具体取决于合并分区的顺序.

Important thing to note is that results can differ from run to run depending on an order in which partitions are combined.

这篇关于火花RDD的折叠方法的说明的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆