将Spark DataFrame的内容另存为单个CSV文件 [英] Save content of Spark DataFrame as a single CSV file

查看:1178
本文介绍了将Spark DataFrame的内容另存为单个CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有一个Spark DataFrame,我想另存为CSV文件.在 Spark 2.0.0 之后, DataFrameWriter 类直接支持将其另存为CSV文件.

Say I have a Spark DataFrame which I want to save as CSV file. After Spark 2.0.0 , DataFrameWriter class directly supports saving it as a CSV file.

默认行为是将输出保存在提供的路径内的多个 part-*.csv 文件中.

The default behavior is to save the output in multiple part-*.csv files inside the path provided.

我如何用:保存DF?

  1. 路径映射到确切的文件名而不是文件夹
  2. 标题在第一行可用
  3. 另存为单个文件,而不是多个文件.

一种处理方法是合并DF,然后保存文件.

One way to deal with it, is to coalesce the DF and then save the file.

df.coalesce(1).write.option("header", "true").csv("sample_file.csv")

但是,这样做不利于在Master机器上收集它,并且需要一个具有足够内存的master.

However this has disadvantage in collecting it on Master machine and needs to have a master with enough memory.

是否可以在不使用 coalesce 的情况下编写单个CSV文件?如果没有,是否有比上述代码有效的方法?

Is it possible to write a single CSV file without using coalesce ? If not, is there a efficient way than the above code ?

推荐答案

只需使用 pyspark 和dbutils来解决此问题,即可获取.csv并重命名为所需的文件名.

Just solved this myself using pyspark with dbutils to get the .csv and rename to the wanted filename.

save_location= "s3a://landing-bucket-test/export/"+year
csv_location = save_location+"temp.folder"
file_location = save_location+'export.csv'

df.repartition(1).write.csv(path=csv_location, mode="append", header="true")

file = dbutils.fs.ls(csv_location)[-1].path
dbutils.fs.cp(file, file_location)
dbutils.fs.rm(csv_location, recurse=True)

不使用[-1]可以改善此答案,但是.csv似乎始终位于文件夹的最后.如果您只处理较小的文件,并且可以使用repartition(1)或coalesce(1),则简单而快速的解决方案.

This answer can be improved by not using [-1], but the .csv seems to always be last in the folder. Simple and fast solution if you only work on smaller files and can use repartition(1) or coalesce(1).

这篇关于将Spark DataFrame的内容另存为单个CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆