Apache Spark-如何压缩多个RDD [英] Apache Spark - How to zip multiple RDDs
问题描述
假设我有很多RDD,也许是 RDD [Int]
,我有一个函数定义了一个int序列并返回一个int,像是折叠: f:Seq [Int] => Int
。
Let's say I have a bunch of RDD's, maybe RDD[Int]
, and I have a function that defines an operation on a sequence of ints and returns an int, like a fold: f: Seq[Int] => Int
.
如果我有一系列RDD, Seq [RDD [Int]]
,我该如何应用该函数并返回具有结果值的单个新RDD?我似乎在Spark中找不到实现此目标的 zipPartitions
方法。
If I have a sequence of RDD's, Seq[RDD[Int]]
, how do I apply the function and return a single new RDD with the resulting value? I don't seem to find a zipPartitions
method in Spark which accomplishes this.
推荐答案
在某些时候, Seq [Int]
的元素需要绑定到 f
的参数。是通过创建集合(实现列表)还是通过将它们逐个绑定在部分功能应用方式,有时需要一个包含所有元素的类似集合的数据结构。当然,一旦进入 f
,它们都必须位于同一位置。
At some point the elements of the Seq[Int]
need to be bound to the parameters of f
. Whether this occurs beforehand by creating a collection ("materializing the lists") or by binding them one-by-one in a partial function application manner, at some point there needs to be a collection-like data structure that contains all of the elements. Certainly, once inside f
, they all need to be in the same place.
这里的功能要稍微多一些Spiro的makeZip函数的样式实现:
Here is a slightly more functional style implementation of Spiro's makeZip function:
def makeZip(xs: ListBuffer[RDD[Double]]): RDD[ListBuffer[Double]] = {
// initialize with arrays of length 1
val init = xs(0).map { ListBuffer(_) }
// fold in remaining by appending to mutable list
xs.drop(1).foldLeft(init) {
(rddS, rddXi) => rddS.zip(rddXi).map(sx => sx._1 += sx._2)
}
}
这篇关于Apache Spark-如何压缩多个RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!