将Spark DataFrame的内容另存为单个CSV文件 [英] Save content of Spark DataFrame as a single CSV file

查看：1178 发布时间：2020/7/11 20:42:15 csv apache-spark pyspark

本文介绍了将Spark DataFrame的内容另存为单个CSV文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

说我有一个Spark DataFrame，我想另存为CSV文件.在 Spark 2.0.0 之后， DataFrameWriter 类直接支持将其另存为CSV文件.

Say I have a Spark DataFrame which I want to save as CSV file. After Spark 2.0.0 , DataFrameWriter class directly supports saving it as a CSV file.

默认行为是将输出保存在提供的路径内的多个 part-*.csv 文件中.

The default behavior is to save the output in multiple part-*.csv files inside the path provided.

我如何用:保存DF?

路径映射到确切的文件名而不是文件夹
标题在第一行可用
另存为单个文件，而不是多个文件.

一种处理方法是合并DF，然后保存文件.

One way to deal with it, is to coalesce the DF and then save the file.

df.coalesce(1).write.option("header", "true").csv("sample_file.csv")

但是，这样做不利于在Master机器上收集它，并且需要一个具有足够内存的master.

However this has disadvantage in collecting it on Master machine and needs to have a master with enough memory.

是否可以在不使用 coalesce 的情况下编写单个CSV文件?如果没有，是否有比上述代码有效的方法?

Is it possible to write a single CSV file without using coalesce ? If not, is there a efficient way than the above code ?

推荐答案

只需使用 pyspark 和dbutils来解决此问题，即可获取.csv并重命名为所需的文件名.

Just solved this myself using pyspark with dbutils to get the .csv and rename to the wanted filename.

save_location= "s3a://landing-bucket-test/export/"+year
csv_location = save_location+"temp.folder"
file_location = save_location+'export.csv'

df.repartition(1).write.csv(path=csv_location, mode="append", header="true")

file = dbutils.fs.ls(csv_location)[-1].path
dbutils.fs.cp(file, file_location)
dbutils.fs.rm(csv_location, recurse=True)

不使用[-1]可以改善此答案，但是.csv似乎始终位于文件夹的最后.如果您只处理较小的文件，并且可以使用repartition(1)或coalesce(1)，则简单而快速的解决方案.

This answer can be improved by not using [-1], but the .csv seems to always be last in the folder. Simple and fast solution if you only work on smaller files and can use repartition(1) or coalesce(1).

这篇关于将Spark DataFrame的内容另存为单个CSV文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将Spark DataFrame的内容另存为单个CSV文件 [英] Save content of Spark DataFrame as a single CSV file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将Spark DataFrame的内容另存为单个CSV文件 [英] Save content of Spark DataFrame as a single CSV file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭