星火RDD到CSV - 添加空列 [英] Spark RDD to CSV - Add empty columns
问题描述
我有一个RDD [地图[字符串,INT]]其中地图的键是列名。每张地图是不完整的,要知道我需要工会的所有键的列名。是否有办法避免这种收集操作,知道所有的按键和只使用一次rdd.saveAsTextFile(..)来获取CSV?
I have a RDD[Map[String,Int]] where the keys of the maps are the column names. Each map is incomplete and to know the column names I would need to union all the keys. Is there a way to avoid this collect operation to know all the keys and use just once rdd.saveAsTextFile(..) to get the csv?
举例来说,假设我有两个元素(斯卡拉表示法)的RDD:
For example, say I have an RDD with two elements (scala notation):
Map("a"->1, "b"->2)
Map("b"->1, "c"->3)
我想这个CSV落得:
I would like to end up with this csv:
a,b,c
1,2,0
0,1,3
Scala的解决方案是更好,但任何其他火花兼容的语言会怎么做。
Scala solutions are better but any other Spark-compatible language would do.
编辑:
我可以尝试从另一个方向也解决了我的问题。比方说,我莫名其妙地知道在一开始的所有列,但我想摆脱那些在所有地图0值列。所以问题变得,我知道的密钥(A,B,C),并从这样的:
I can try to solve my problem from another direction also. Let's say I somehow know all the columns in the beginning, but I want to get rid of columns that have 0 value in all maps. So the problem becomes, I know that the keys are ("a", "b", "c") and from this:
Map("a"->1, "b"->2, "c"->0)
Map("a"->3, "b"->1, "c"->0)
我需要写CSV:
I need to write the csv:
a,b
1,2
3,1
有没有可能只有一个做到这一点收集?
Would it be possible to do this with only one collect?
推荐答案
如果你的说法是:在我的每一个RDD新的元素可能会增加我还没有看到,到目前为止新的列名,答案显然是可以'T避免全表扫描。但你并不需要收集驾驶员的所有元素。
If you're statement is: "every new element in my RDD may add a new column name I have not seen so far", the answer is obviously can't avoid a full scan. But you don't need to collect all elements on the driver.
您可以使用总
来只收集列名。这种方法有两个功能,一个是到单个元件插入到所得到的集合,和另一个从两个不同的分区合并的结果。
You could use aggregate
to only collect column names. This method takes two functions, one is to insert a single element into the resulting collection, and another one to merge results from two different partitions.
rdd.aggregate(Set.empty[String])( {(s, m) => s union m.keySet }, { (s1, s2) => s1 union s2 })
您将返回一组中的所有RDD列名。在第二次扫描可以打印CSV文件。
You will get back a set of all column names in the RDD. In a second scan you can print the CSV file.
这篇关于星火RDD到CSV - 添加空列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!