火花RDD的折叠方法的说明 [英] Explanation of fold method of spark RDD
问题描述
我正在运行为Hadoop-2.4预先构建的Spark-1.4.0(以本地模式),以计算DoubleRDD的平方和.我的Scala代码看起来像
I am running Spark-1.4.0 pre-built for Hadoop-2.4 (in local mode) to calculate the sum of squares of a DoubleRDD. My Scala code looks like
sc.parallelize(Array(2., 3.)).fold(0.0)((p, v) => p+v*v)
它给出了令人惊讶的结果97.0
.
And it gave a surprising result 97.0
.
与Scala版本的fold
Array(2., 3.).fold(0.0)((p, v) => p+v*v)
给出预期的答案13.0
.
由于缺乏理解,我很可能在代码中犯了一些棘手的错误.我已经了解了RDD.fold()
中使用的函数应该如何进行通信,否则结果可能取决于分区等.因此,例如,如果我将分区数更改为1,
It seems quite likely that I have made some tricky mistakes in the code due to a lack of understanding. I have read about how the function used in RDD.fold()
should be communicative otherwise the result may depend on partitions and etc. So example, if I change the number of partitions to 1,
sc.parallelize(Array(2., 3.), 1).fold(0.0)((p, v) => p+v*v)
代码将在我的计算机上为我提供169.0
!
the code will give me 169.0
on my machine!
有人可以解释这里到底发生了什么吗?
Can someone explain what exactly is happening here?