在保存到Spark中的文本文件之前添加标题 [英] Add a header before text file on save in Spark

查看:107
本文介绍了在保存到Spark中的文本文件之前添加标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些火花代码来处理csv文件.它对此进行了一些转换.我现在想将此RDD保存为csv文件并添加标题.该RDD的每一行均已正确格式化.

I have some spark code to process a csv file. It does some transformation on it. I now want to save this RDD as a csv file and add a header. Each line of this RDD is already formatted correctly.

我不确定该怎么做.我想对标题字符串和我的RDD进行合并,但是标题字符串不是RDD,所以它不起作用.

I am not sure how to do it. I wanted to do a union with the header string and my RDD but the header string is not an RDD so it does not work.

推荐答案

您可以在标题行中制作一个RDD,然后union进行,是的,

You can make an RDD out of your header line and then union it, yes:

val rdd: RDD[String] = ...
val header: RDD[String] = sc.parallelize(Array("my,header,row"))
header.union(rdd).saveAsTextFile(...)

然后,您将得到一堆合并的part-xxxxx个文件.

Then you end up with a bunch of part-xxxxx files that you merge.

问题是,我认为您不能保证标头将是第一个分区,因此不会在part-00000中并在文件顶部.实际上,我可以肯定.

The problem is that I don't think you're guaranteed that the header will be the first partition and therefore end up in part-00000 and at the top of your file. In practice, I'm pretty sure it will.

更可靠的方法是使用hdfs之类的Hadoop命令来合并part-xxxxx文件,并且作为该命令的一部分,只需将文件的标题行丢进去即可.

More reliable would be to use Hadoop commands like hdfs to merge the part-xxxxx files, and as part of the command, just throw in the header line from a file.

这篇关于在保存到Spark中的文本文件之前添加标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆