如何在转星火的RDD [英] How to transpose an RDD in Spark

查看:159
本文介绍了如何在转星火的RDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的RDD:

I have an RDD like this:

1 2 3
4 5 6
7 8 9

有一个矩阵。现在我想转的RDD是这样的:

It is a matrix. Now I want to transpose the RDD like this:

1 4 7
2 5 8
3 6 9

我怎样才能做到这一点?

How can I do this?

推荐答案

假设你有一个N&次;×M矩阵

Say you have an N×M matrix.

如果N和M都非常小,你可以按住N'倍;存储器M的项目,它并没有多大意义,使用RDD。但换位很容易:

If both N and M are so small that you can hold N×M items in memory, it doesn't make much sense to use an RDD. But transposing it is easy:

val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
val transposed = sc.parallelize(rdd.collect.toSeq.transpose)

如果N或M是如此之大,你不能在内存中保留N或者M的条目,那么你就不能有这种规模的RDD线。无论是原始或转置矩阵是不可能在此情况下,重新present

If N or M is so large that you cannot hold N or M entries in memory, then you cannot have an RDD line of this size. Either the original or the transposed matrix is impossible to represent in this case.

N和M可以是一个中等规模的:你可以在内存中保留N或者M条目,但你不能持有N'次,M项。在这种情况下,你必须炸毁矩阵,并再次把它在一起:

N and M may be of a medium size: you can hold N or M entries in memory, but you cannot hold N×M entries. In this case you have to blow up the matrix and put it together again:

val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
// Split the matrix into one number per line.
val byColumnAndRow = rdd.zipWithIndex.flatMap {
  case (row, rowIndex) => row.zipWithIndex.map {
    case (number, columnIndex) => columnIndex -> (rowIndex, number)
  }
}
// Build up the transposed matrix. Group and sort by column index first.
val byColumn = byColumnAndRow.groupByKey.sortByKey().values
// Then sort by row index.
val transposed = byColumn.map {
  indexedRow => indexedRow.toSeq.sortBy(_._1).map(_._2)
}

这篇关于如何在转星火的RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆