Apache Spark-如何压缩多个RDD [英] Apache Spark - How to zip multiple RDDs

查看：163 发布时间：2020/10/22 19:05:47 scala distributed-computing apache-spark

本文介绍了Apache Spark-如何压缩多个RDD的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有很多RDD，也许是 RDD [Int] ，我有一个函数定义了一个int序列并返回一个int，像是折叠： f：Seq [Int] => Int 。

Let's say I have a bunch of RDD's, maybe RDD[Int], and I have a function that defines an operation on a sequence of ints and returns an int, like a fold: f: Seq[Int] => Int.

如果我有一系列RDD， Seq [RDD [Int]] ，我该如何应用该函数并返回具有结果值的单个新RDD？我似乎在Spark中找不到实现此目标的 zipPartitions 方法。

If I have a sequence of RDD's, Seq[RDD[Int]], how do I apply the function and return a single new RDD with the resulting value? I don't seem to find a zipPartitions method in Spark which accomplishes this.

推荐答案

在某些时候， Seq [Int] 的元素需要绑定到 f 的参数。是通过创建集合（实现列表）还是通过将它们逐个绑定在部分功能应用方式，有时需要一个包含所有元素的类似集合的数据结构。当然，一旦进入 f ，它们都必须位于同一位置。

At some point the elements of the Seq[Int] need to be bound to the parameters of f. Whether this occurs beforehand by creating a collection ("materializing the lists") or by binding them one-by-one in a partial function application manner, at some point there needs to be a collection-like data structure that contains all of the elements. Certainly, once inside f, they all need to be in the same place.

这里的功能要稍微多一些Spiro的makeZip函数的样式实现：

Here is a slightly more functional style implementation of Spiro's makeZip function:

def makeZip(xs: ListBuffer[RDD[Double]]): RDD[ListBuffer[Double]] = {
  // initialize with arrays of length 1
  val init = xs(0).map { ListBuffer(_) } 
  // fold in remaining by appending to mutable list
  xs.drop(1).foldLeft(init) { 
    (rddS, rddXi) => rddS.zip(rddXi).map(sx => sx._1 += sx._2)
  }
}

这篇关于Apache Spark-如何压缩多个RDD的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Apache Spark-如何压缩多个RDD [英] Apache Spark - How to zip multiple RDDs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Apache Spark-如何压缩多个RDD [英] Apache Spark - How to zip multiple RDDs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭