使用火花CSV写入一个CSV文件 [英] Write single CSV file using spark-csv

查看：331 发布时间：2016/5/22 15:14:22 scala csv apache-spark spark-csv

本文介绍了使用火花CSV写入一个CSV文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用 https://github.com/databricks/spark-csv ，我想写一个单CSV，但不能够，它正在一个文件夹。

I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.

需要将带参数一样的路径和文件名，并写CSV文件中阶的功能。

Need a scala function which will take parameter like path and file name and write that CSV file.

推荐答案

这是创建多个文件的文件夹，因为每个分区单独保存。如果你需要（在文件夹中仍然）一个输出文件就可以了再分配数据帧保存之前：

It is creating a folder with multiple files, because each partition is saved individually. If you need a single output file (still in a folder) you can repartition data frame before saving:

df
   // place all data in a single partition 
   .coalesce(1)
   .write.format("com.databricks.spark.csv")
   .option("header", "true")
   .save("mydata.csv")

所有数据都将被写入 mydata.csv /一部分-00000 。在使用该选项的确保你明白是怎么回事，什么是所有的数据传送到一个工人的成本。如果你使用分布式文件系统复制，数据将被转移多次 - 先取到一个工人随后用存储节点分布。

All data will written to mydata.csv/part-00000. Before you use this option be sure you understand what is going on and what is the cost of transferring all data to a single worker. If you use distributed file system with replication, data will be transfered multiple times - first fetched to a single worker and subsequently distributed over storage nodes.

另外，您可以留下您的code，因为它是并使用通用的工具，如猫或<一个href=\"https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#getmerge\"相对=nofollow> HDFS getmerge 简单地合并后的所有部件。

Alternatively you can leave your code as it is and use general purpose tools like cat or HDFS getmerge to simply merge all the parts afterwards.

这篇关于使用火花CSV写入一个CSV文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用火花CSV写入一个CSV文件 [英] Write single CSV file using spark-csv

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用火花CSV写入一个CSV文件 [英] Write single CSV file using spark-csv

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭