使用火花CSV写入一个CSV文件 [英] Write single CSV file using spark-csv
问题描述
我使用 https://github.com/databricks/spark-csv ,我想写一个单CSV,但不能够,它正在一个文件夹。
I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.
需要将带参数一样的路径和文件名,并写CSV文件中阶的功能。
Need a scala function which will take parameter like path and file name and write that CSV file.
推荐答案
这是创建多个文件的文件夹,因为每个分区单独保存。如果你需要(在文件夹中仍然)一个输出文件就可以了再分配
数据帧保存之前:
It is creating a folder with multiple files, because each partition is saved individually. If you need a single output file (still in a folder) you can repartition
data frame before saving:
df
// place all data in a single partition
.coalesce(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("mydata.csv")
所有数据都将被写入 mydata.csv /一部分-00000
。在使用该选项的确保你明白是怎么回事,什么是所有的数据传送到一个工人的成本。如果你使用分布式文件系统复制,数据将被转移多次 - 先取到一个工人随后用存储节点分布。
All data will written to mydata.csv/part-00000
. Before you use this option be sure you understand what is going on and what is the cost of transferring all data to a single worker. If you use distributed file system with replication, data will be transfered multiple times - first fetched to a single worker and subsequently distributed over storage nodes.
另外,您可以留下您的code,因为它是并使用通用的工具,如猫
或<一个href=\"https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#getmerge\"相对=nofollow> HDFS getmerge
简单地合并后的所有部件。
Alternatively you can leave your code as it is and use general purpose tools like cat
or HDFS getmerge
to simply merge all the parts afterwards.
这篇关于使用火花CSV写入一个CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!