如何将 spark DataFrame 保存为磁盘上的 csv? [英] How to save a spark DataFrame as csv on disk?
问题描述
比如这样的结果:
df.filter("project = 'en'").select("title","count").groupBy("title").sum()
将返回一个数组.
如何将 spark DataFrame 保存为磁盘上的 csv 文件?
How to save a spark DataFrame as a csv file on disk ?
推荐答案
Apache Spark 不支持磁盘上的本机 CSV 输出.
Apache Spark does not support native CSV output on disk.
不过,您有四种可用的解决方案:
You have four available solutions though:
您可以将数据帧转换为 RDD :
You can convert your Dataframe into an RDD :
def convertToReadableString(r : Row) = ???
df.rdd.map{ convertToReadableString }.saveAsTextFile(filepath)
这将创建一个文件夹文件路径.在文件路径下,您会找到分区文件(例如 part-000*)
This will create a folder filepath. Under the file path, you'll find partitions files (e.g part-000*)
如果我想将所有分区附加到一个大的 CSV 中,我通常会做的是
What I usually do if I want to append all the partitions into a big CSV is
cat filePath/part* > mycsvfile.csv
有些人会使用 coalesce(1,false)
从 RDD 创建一个分区.这通常是一种不好的做法,因为它可能会将您收集的所有数据都提取到驱动程序中.
Some will use coalesce(1,false)
to create one partition from the RDD. It's usually a bad practice, since it may overwhelm the driver by pulling all the data you are collecting to it.
请注意,df.rdd
将返回一个 RDD[Row]
.
Note that df.rdd
will return an RDD[Row]
.
通过 Spark <2,您可以使用 databricks spark-csv 图书馆:
With Spark <2, you can use databricks spark-csv library:
Spark 1.4+:
Spark 1.4+:
df.write.format("com.databricks.spark.csv").save(filepath)
Spark 1.3:
Spark 1.3:
df.save(filepath,"com.databricks.spark.csv")
Spark 2.x 不需要 spark-csv
包,因为它包含在 Spark 中.
With Spark 2.x the spark-csv
package is not needed as it's included in Spark.
df.write.format("csv").save(filepath)
您可以转换为本地 Pandas 数据框并使用 to_csv
方法(仅限 PySpark).
You can convert to local Pandas data frame and use to_csv
method (PySpark only).
注意:解决方案 1、2 和 3 将导致 Spark 在调用 part-*
)>保存代码>.每个分区将有一个 part-
文件.
Note: Solutions 1, 2 and 3 will result in CSV format files (part-*
) generated by the underlying Hadoop API that Spark calls when you invoke save
. You will have one part-
file per partition.
这篇关于如何将 spark DataFrame 保存为磁盘上的 csv?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!