将Spark DataFrame的内容另存为单个CSV文件 [英] Save content of Spark DataFrame as a single CSV file
问题描述
说我有一个Spark DataFrame,我想另存为CSV文件.在 Spark 2.0.0 之后, DataFrameWriter 类直接支持将其另存为CSV文件.
Say I have a Spark DataFrame which I want to save as CSV file. After Spark 2.0.0 , DataFrameWriter class directly supports saving it as a CSV file.
默认行为是将输出保存在提供的路径内的多个 part-*.csv 文件中.
The default behavior is to save the output in multiple part-*.csv files inside the path provided.
我如何用:保存DF?
- 路径映射到确切的文件名而不是文件夹
- 标题在第一行可用
- 另存为单个文件,而不是多个文件.
一种处理方法是合并DF,然后保存文件.
One way to deal with it, is to coalesce the DF and then save the file.
df.coalesce(1).write.option("header", "true").csv("sample_file.csv")
但是,这样做不利于在Master机器上收集它,并且需要一个具有足够内存的master.
However this has disadvantage in collecting it on Master machine and needs to have a master with enough memory.
是否可以在不使用 coalesce 的情况下编写单个CSV文件?如果没有,是否有比上述代码有效的方法?
Is it possible to write a single CSV file without using coalesce ? If not, is there a efficient way than the above code ?
推荐答案
只需使用 pyspark 和dbutils来解决此问题,即可获取.csv并重命名为所需的文件名.
Just solved this myself using pyspark with dbutils to get the .csv and rename to the wanted filename.
save_location= "s3a://landing-bucket-test/export/"+year
csv_location = save_location+"temp.folder"
file_location = save_location+'export.csv'
df.repartition(1).write.csv(path=csv_location, mode="append", header="true")
file = dbutils.fs.ls(csv_location)[-1].path
dbutils.fs.cp(file, file_location)
dbutils.fs.rm(csv_location, recurse=True)
不使用[-1]可以改善此答案,但是.csv似乎始终位于文件夹的最后.如果您只处理较小的文件,并且可以使用repartition(1)或coalesce(1),则简单而快速的解决方案.
This answer can be improved by not using [-1], but the .csv seems to always be last in the folder. Simple and fast solution if you only work on smaller files and can use repartition(1) or coalesce(1).
这篇关于将Spark DataFrame的内容另存为单个CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!