将DataFrame保存为CSV时指定文件名 [英] Specifying the filename when saving a DataFrame as a CSV

查看:1571
本文介绍了将DataFrame保存为CSV时指定文件名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有一个Spark DF,我想保存一个CSV文件到磁盘.在Spark 2.0.0+中,可以将 DataFrame(DataSet[Rows]) 转换为 DataFrameWriter ,并使用 .csv 方法写入文件

Say I have a Spark DF that I want to save to disk a CSV file. In Spark 2.0.0+, one can convert DataFrame(DataSet[Rows]) as a DataFrameWriter and use the .csv method to write the file.

该函数定义为

def csv(path: String): Unit
    path : the location/folder name and not the file name.

Spark将csv文件存储在通过创建名称为part-*.csv的CSV文件而指定的位置.

Spark stores the csv file at the location specified by creating CSV files with name - part-*.csv.

是否可以使用指定的文件名而不是part-*.csv保存CSV?或者可以指定前缀代替part-r?

Is there a way to save the CSV with specified filename instead of part-*.csv ? Or possible to specify prefix to instead of part-r ?

代码:

df.coalesce(1).write.csv("sample_path")

当前输出:

sample_path
|
+-- part-r-00000.csv

所需的输出:

sample_path
|
+-- my_file.csv

注意::合并功能用于输出单个文件,执行程序具有足够的内存来收集DF,而不会出现内存错误.

Note : The coalesce function is used to output a single file and the executor has enough memory to collect the DF without memory error.

推荐答案

不可能直接在Spark的save

It's not possible to do it directly in Spark's save

Spark使用Hadoop文件格式,这需要对数据进行分区-这就是为什么您拥有part-文件的原因.您可以像中一样,在处理后轻松更改文件名.这个问题

Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. You can easily change filename after processing just like in this question

在Scala中,它看起来像:

In Scala it will look like:

import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val file = fs.globStatus(new Path("path/file.csv/part*"))(0).getPath().getName()

fs.rename(new Path("csvDirectory/" + file), new Path("mydata.csv"))
fs.delete(new Path("mydata.csv-temp"), true)

或者只是:

import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
fs.rename(new Path("csvDirectory/data.csv/part-0000"), new Path("csvDirectory/newData.csv"))

如评论中所述,您也可以编写自己的OutputFormat,请参见信息有关设置文件名的方法

As mentioned in comments, you can also write your own OutputFormat, please see documents for information about this approach to set file name

这篇关于将DataFrame保存为CSV时指定文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆