星火拼合序列通过反转GROUPBY,(即重复标题为它的每个序列) [英] Spark Flatten Seq by reversing groupby, (i.e. repeat header for each sequence in it)
问题描述
我们有一个RDD用以下形式:
We have an RDD with the following form:
org.apache.spark.rdd.RDD[((BigInt, String), Seq[(BigInt, Int)])]
我们希望做的是扁平化的成制表符分隔字符串的一个列表来保存saveAsText文件。并通过扁平化,我的意思是重复GROUPBY元组(BigInt有,字符串)在其序列的每个项目。
What we would like to do is flatten that into a single list of tab delimited strings to save with saveAsText file. And by flatten, I mean repeat the groupby tuple (BigInt, String) for each item in its Seq.
所以看起来像..数据
((x1,x2), ((y1.1,y1.2), (y2.1, y2.2) .... ))
...将结束看起来像
... Will wind up looking like
x1 x2 y1.1 y1.2
x1 x2 y2.1 y2.2
到目前为止,code我试过大多是这一切变得平坦,只是一条线,X1,X2,y1.1,y1.2,y2.1,y2.2 ......等。
So far the code I've tried mostly flattens it all to just one line, "x1, x2, y1.1, y1.2, y2.1, y2.2 ..." etc...
任何帮助将是AP preciated,先谢谢了!
Any help would be appreciated, thanks in advance!
推荐答案
如果你想变平一个groupByKey()操作的结果,因此,无论是键和值列被压扁成一个元组,我建议使用flatMap:
If you want to flatten the results of a groupByKey() operation so that both the key and value columns are flattened into one tuple, I recommend using flatMap:
val grouped = sc.parallelize(Seq(((1,"two"), List((3,4), (5,6)))))
val flattened: RDD[(Int, String, Int, Int)] = grouped.flatMap { case (key, groupValues) =>
groupValues.map { value => (key._1, key._2, value._1, value._2) }
}
// flattened.collect() is Array((1,two,3,4), (1,two,5,6))
在这里,您可以使用额外的转换和操作将结合元组转换为制表符分隔字符串,并保存输出。
From here, you can use additional transformations and actions to convert your combined tuple into a tab-separated string and to save the output.
如果你不关心扁平RDD包含元组
,那么你可以写出更一般
If you don't care about the flattened RDD containing Tuples
, then you can write the more general
val flattened: RDD[Array[Any]] = grouped.flatMap { case (key, groupValues) =>
groupValues.map(value => (key.productIterator ++ value.productIterator).toArray)
}
// flattened.collect() is Array(Array(1, two, 3, 4), Array(1, two, 5, 6))
此外,检查出的 flatMapValues
转变;如果你有一个 RDD [(K,SEQ [V])]
,并希望 RDD [(K,V)]
,那么你可以做 flatMapValues(身份)
。
Also, check out the flatMapValues
transformation; if you have an RDD[(K, Seq[V]])]
and want RDD[(K, V)]
, then you can do flatMapValues(identity)
.
这篇关于星火拼合序列通过反转GROUPBY,(即重复标题为它的每个序列)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!