使用火花CSV写入一个CSV文件 [英] Write single CSV file using spark-csv

查看:331
本文介绍了使用火花CSV写入一个CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 https://github.com/databricks/spark-csv ,我想写一个单CSV,但不能够,它正在一个文件夹。

I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.

需要将带参数一样的路径和文件名,并写CSV文件中阶的功能。

Need a scala function which will take parameter like path and file name and write that CSV file.

推荐答案

这是创建多个文件的文件夹,因为每个分区单独保存。如果你需要(在文件夹中仍然)一个输出文件就可以了再分配数据帧保存之前:

It is creating a folder with multiple files, because each partition is saved individually. If you need a single output file (still in a folder) you can repartition data frame before saving:

df
   // place all data in a single partition 
   .coalesce(1)
   .write.format("com.databricks.spark.csv")
   .option("header", "true")
   .save("mydata.csv")

所有数据都将被写入 mydata.csv /一部分-00000 。在使用该选项的确保你明白是怎么回事,什么是所有的数据传送到一个​​工人的成本。如果你使用分布式文件系统复制,数据将被转移多次 - 先取到一个工人随后用存储节点分布。

All data will written to mydata.csv/part-00000. Before you use this option be sure you understand what is going on and what is the cost of transferring all data to a single worker. If you use distributed file system with replication, data will be transfered multiple times - first fetched to a single worker and subsequently distributed over storage nodes.

另外,您可以留下您的code,因为它是并使用通用的工具,如或<一个href=\"https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#getmerge\"相对=nofollow> HDFS getmerge 简单地合并后的所有部件。

Alternatively you can leave your code as it is and use general purpose tools like cat or HDFS getmerge to simply merge all the parts afterwards.

这篇关于使用火花CSV写入一个CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆