Apache Spark-如何压缩多个RDD [英] Apache Spark - How to zip multiple RDDs

查看:163
本文介绍了Apache Spark-如何压缩多个RDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有很多RDD,也许是 RDD [Int] ,我有一个函数定义了一个int序列并返回一个int,像是折叠: f:Seq [Int] => Int

Let's say I have a bunch of RDD's, maybe RDD[Int], and I have a function that defines an operation on a sequence of ints and returns an int, like a fold: f: Seq[Int] => Int.

如果我有一系列RDD, Seq [RDD [Int]] ,我该如何应用该函数并返回具有结果值的单个新RDD?我似乎在Spark中找不到实现此目标的 zipPartitions 方法。

If I have a sequence of RDD's, Seq[RDD[Int]], how do I apply the function and return a single new RDD with the resulting value? I don't seem to find a zipPartitions method in Spark which accomplishes this.

推荐答案

在某些时候, Seq [Int] 的元素需要绑定到 f 的参数。是通过创建集合(实现列表)还是通过将它们逐个绑定在部分功能应用方式,有时需要一个包含所有元素的类似集合的数据结构。当然,一旦进入 f ,它们都必须位于同一位置。

At some point the elements of the Seq[Int] need to be bound to the parameters of f. Whether this occurs beforehand by creating a collection ("materializing the lists") or by binding them one-by-one in a partial function application manner, at some point there needs to be a collection-like data structure that contains all of the elements. Certainly, once inside f, they all need to be in the same place.

这里的功能要稍微多一些Spiro的makeZip函数的样式实现:

Here is a slightly more functional style implementation of Spiro's makeZip function:

def makeZip(xs: ListBuffer[RDD[Double]]): RDD[ListBuffer[Double]] = {
  // initialize with arrays of length 1
  val init = xs(0).map { ListBuffer(_) } 
  // fold in remaining by appending to mutable list
  xs.drop(1).foldLeft(init) { 
    (rddS, rddXi) => rddS.zip(rddXi).map(sx => sx._1 += sx._2)
  }
}

这篇关于Apache Spark-如何压缩多个RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆