将 DataFrame 保存为 CSV 时指定文件名 [英] Specifying the filename when saving a DataFrame as a CSV
问题描述
假设我有一个 Spark DF,我想将 CSV 文件保存到磁盘.在 Spark 2.0.0+ 中,可以将 DataFrame(DataSet[Rows])
转换为 DataFrameWriter
并使用.csv
方法来写入文件.
Say I have a Spark DF that I want to save to disk a CSV file. In Spark 2.0.0+, one can convert DataFrame(DataSet[Rows])
as a DataFrameWriter
and use the .csv
method to write the file.
函数定义为
def csv(path: String): Unit
path : the location/folder name and not the file name.
Spark 将 csv 文件存储在通过创建名称为 part-*.csv 的 CSV 文件指定的位置.
Spark stores the csv file at the location specified by creating CSV files with name - part-*.csv.
有没有办法用指定的文件名而不是 part-*.csv 保存 CSV ?或者可以指定前缀代替 part-r ?
Is there a way to save the CSV with specified filename instead of part-*.csv ? Or possible to specify prefix to instead of part-r ?
代码:
df.coalesce(1).write.csv("sample_path")
电流输出:
sample_path
|
+-- part-r-00000.csv
期望输出:
sample_path
|
+-- my_file.csv
注意:coalesce 函数用于输出单个文件,执行器有足够的内存来收集 DF,不会出现内存错误.
Note : The coalesce function is used to output a single file and the executor has enough memory to collect the DF without memory error.
推荐答案
在 Spark 的 save
Spark 使用 Hadoop 文件格式,它需要对数据进行分区 - 这就是您拥有 part-
文件的原因.您可以在处理后轻松更改文件名,就像在 这个问题
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part-
files. You can easily change filename after processing just like in this question
在 Scala 中,它看起来像:
In Scala it will look like:
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val file = fs.globStatus(new Path("path/file.csv/part*"))(0).getPath().getName()
fs.rename(new Path("csvDirectory/" + file), new Path("mydata.csv"))
fs.delete(new Path("mydata.csv-temp"), true)
或者只是:
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
fs.rename(new Path("csvDirectory/data.csv/part-0000"), new Path("csvDirectory/newData.csv"))
如评论中所述,您也可以编写自己的OutputFormat,请参阅信息关于这种设置文件名的方法
As mentioned in comments, you can also write your own OutputFormat, please see documents for information about this approach to set file name
这篇关于将 DataFrame 保存为 CSV 时指定文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!