星火RDD到CSV - 添加空列 [英] Spark RDD to CSV - Add empty columns

查看:639
本文介绍了星火RDD到CSV - 添加空列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个RDD [地图[字符串,INT]]其中地图的键是列名。每张地图是不完整的,要知道我需要工会的所有键的列名。是否有办法避免这种收集操作,知道所有的按键和只使用一次rdd.saveAsTextFile(..)来获取CSV?

I have a RDD[Map[String,Int]] where the keys of the maps are the column names. Each map is incomplete and to know the column names I would need to union all the keys. Is there a way to avoid this collect operation to know all the keys and use just once rdd.saveAsTextFile(..) to get the csv?

举例来说,假设我有两个元素(斯卡拉表示法)的RDD:

For example, say I have an RDD with two elements (scala notation):

Map("a"->1, "b"->2)
Map("b"->1, "c"->3)

我想这个CSV落得:

I would like to end up with this csv:

a,b,c
1,2,0
0,1,3

Scala的解决方案是更好,但任何其他火花兼容的语言会怎么做。

Scala solutions are better but any other Spark-compatible language would do.

编辑:

我可以尝试从另一个方向也解决了我的问题。比方说,我莫名其妙地知道在一开始的所有列,但我想摆脱那些在所有地图0值列。所以问题变得,我知道的密钥(A,B,C),并从这样的:

I can try to solve my problem from another direction also. Let's say I somehow know all the columns in the beginning, but I want to get rid of columns that have 0 value in all maps. So the problem becomes, I know that the keys are ("a", "b", "c") and from this:

Map("a"->1, "b"->2, "c"->0)
Map("a"->3, "b"->1, "c"->0)

我需要写CSV:

I need to write the csv:

a,b
1,2
3,1

有没有可能只有一个做到这一点收集?

Would it be possible to do this with only one collect?

推荐答案

如果你的说法是:在我的每一个RDD新的元素可能会增加我还没有看到,到目前为止新的列名,答案显然是可以'T避免全表扫描。但你并不需要收集驾驶员的所有元素。

If you're statement is: "every new element in my RDD may add a new column name I have not seen so far", the answer is obviously can't avoid a full scan. But you don't need to collect all elements on the driver.

您可以使用来只收集列名。这种方法有两个功能,一个是到单个元件插入到所得到的集合,和另一个从两个不同的分区合并的结果。

You could use aggregate to only collect column names. This method takes two functions, one is to insert a single element into the resulting collection, and another one to merge results from two different partitions.

rdd.aggregate(Set.empty[String])( {(s, m) => s union m.keySet }, { (s1, s2) => s1 union s2 })

您将返回一组中的所有RDD列名。在第二次扫描可以打印CSV文件。

You will get back a set of all column names in the RDD. In a second scan you can print the CSV file.

这篇关于星火RDD到CSV - 添加空列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆