如何在Spark中写入CSV [英] How to write to CSV in Spark

查看:1020
本文介绍了如何在Spark中写入CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找到一种将我的Spark Job的结果保存为csv文件的有效方法。我在Hadoop中使用Spark,到目前为止,我的所有文件都保存为 part-00000



任何想法如何使我的火花节省文件与指定的文件名?

解决方案

由于Spark使用Hadoop File System API将数据写入文件,这是不可避免的。如果你这样做

pre $ r code $ rdd.saveAsTextFile(foo)

它将被保存为 foo / part-XXXXX ,RDD中的每个分区都有一个part- *文件你正在尝试保存。 RDD中的每个分区被写入单独的文件的原因是为了容错。如果编写第三个分区的任务(即 part-00002 )失败,Spark只需重新运行任务并覆盖部分写入/损坏的 00002 ,对其他部分没有影响。如果他们都写信给同一个文件,那么恢复单个任务就很难了。



part-XXXXX 文件通常不会成为问题,因为它们都使用HDFS API,如果您要求它们读取foo,它们将全部读取 part-XXXXX 文件也在foo中。

I'm trying to find an effective way of saving the result of my Spark Job as a csv file. I'm using Spark with Hadoop and so far all my files are saved as part-00000.

Any ideas how to make my spark saving to file with a specified file name?

解决方案

Since Spark uses Hadoop File System API to write data to files, this is sort of inevitable. If you do

rdd.saveAsTextFile("foo")

It will be saved as "foo/part-XXXXX" with one part-* file every partition in the RDD you are trying to save. The reason each partition in the RDD is written a separate file is for fault-tolerance. If the task writing 3rd partition (i.e. to part-00002) fails, Spark simply re-run the task and overwrite the partially written/corrupted part-00002, with no effect on other parts. If they all wrote to the same file, then it is much harder recover a single task for failures.

The part-XXXXX files are usually not a problem if you are going to consume it again in Spark / Hadoop-based frameworks because since they all use HDFS API, if you ask them to read "foo", they will all read all the part-XXXXX files inside foo as well.

这篇关于如何在Spark中写入CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆