星火拼合序列通过反转GROUPBY,(即重复标题为它的每个序列) [英] Spark Flatten Seq by reversing groupby, (i.e. repeat header for each sequence in it)

查看:217
本文介绍了星火拼合序列通过反转GROUPBY,(即重复标题为它的每个序列)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个RDD用以下形式:

We have an RDD with the following form:

org.apache.spark.rdd.RDD[((BigInt, String), Seq[(BigInt, Int)])]

我们希望做的是扁平化的成制表符分隔字符串的一个列表来保存saveAsText文件。并通过扁平化,我的意思是重复GROUPBY元组(BigInt有,字符串)在其序列的每个项目。

What we would like to do is flatten that into a single list of tab delimited strings to save with saveAsText file. And by flatten, I mean repeat the groupby tuple (BigInt, String) for each item in its Seq.

所以看起来像..数据

((x1,x2), ((y1.1,y1.2), (y2.1, y2.2) .... ))

...将结束看起来像

... Will wind up looking like

x1   x2   y1.1  y1.2
x1   x2   y2.1  y2.2

到目前为止,code我试过大多是这一切变得平坦,只是一条线,X1,X2,y1.1,y1.2,y2.1,y2.2 ......等。

So far the code I've tried mostly flattens it all to just one line, "x1, x2, y1.1, y1.2, y2.1, y2.2 ..." etc...

任何帮助将是AP preciated,先谢谢了!

Any help would be appreciated, thanks in advance!

推荐答案

如果你想变平一个groupByKey()操作的结果,因此,无论是键和值列被压扁成一个元组,我建议使用flatMap:

If you want to flatten the results of a groupByKey() operation so that both the key and value columns are flattened into one tuple, I recommend using flatMap:

val grouped = sc.parallelize(Seq(((1,"two"), List((3,4), (5,6)))))
val flattened: RDD[(Int, String, Int, Int)] = grouped.flatMap { case (key, groupValues) =>
   groupValues.map { value => (key._1, key._2, value._1, value._2) }
}
// flattened.collect() is Array((1,two,3,4), (1,two,5,6))

在这里,您可以使用额外的转换和操作将结合元组转换为制表符分隔字符串,并保存输出。

From here, you can use additional transformations and actions to convert your combined tuple into a tab-separated string and to save the output.

如果你不关心扁平RDD包含元组,那么你可以写出更一般

If you don't care about the flattened RDD containing Tuples, then you can write the more general

 val flattened: RDD[Array[Any]] = grouped.flatMap { case (key, groupValues) =>
   groupValues.map(value => (key.productIterator ++ value.productIterator).toArray)
 }
 // flattened.collect() is Array(Array(1, two, 3, 4), Array(1, two, 5, 6))

此外,检查出的 flatMapValues​​ 转变;如果你有一个 RDD [(K,SEQ [V])] ,并希望 RDD [(K,V)] ,那么你可以做 flatMapValues​​(身份)

Also, check out the flatMapValues transformation; if you have an RDD[(K, Seq[V]])] and want RDD[(K, V)], then you can do flatMapValues(identity).

这篇关于星火拼合序列通过反转GROUPBY,(即重复标题为它的每个序列)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆