如何将 DataFrame 保存为压缩(gzipped)CSV? [英] How to save a DataFrame as compressed (gzipped) CSV?
问题描述
我使用 Spark 1.6.0 和 Scala.
I use Spark 1.6.0 and Scala.
我想将 DataFrame 保存为压缩的 CSV 格式.
I want to save a DataFrame as compressed CSV format.
这是我到目前为止所拥有的(假设我已经将 df
和 sc
作为 SparkContext
):
Here is what I have so far (assume I already have df
and sc
as SparkContext
):
//set the conf to the codec I want
sc.getConf.set("spark.hadoop.mapred.output.compress", "true")
sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "true")
sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec")
sc.getConf.set("spark.hadoop.mapred.output.compression.type", "BLOCK")
df.write
.format("com.databricks.spark.csv")
.save(my_directory)
输出不是 gz
格式.
推荐答案
Spark 2.2+
df.write.option("compression","gzip").csv("path")
Spark 2.0
df.write.csv("path", compression="gzip")
Spark 1.6
在 spark-csv github 上:https://github.com/databricks/spark-csv
On the spark-csv github: https://github.com/databricks/spark-csv
可以阅读:
codec
:保存到文件时使用的压缩编解码器.应该是实现 org.apache.hadoop.io.compress.CompressionCodec 的类的完全限定名称或不区分大小写的缩写名称之一(bzip2、gzip、lz4 和 snappy).未指定编解码器时默认为无压缩.
codec
: compression codec to use when saving to file. Should be the fully qualified name of a class implementing org.apache.hadoop.io.compress.CompressionCodec or one of case-insensitive shorten names (bzip2, gzip, lz4, and snappy). Defaults to no compression when a codec is not specified.
在这种情况下,这是有效的:df.write.format("com.databricks.spark.csv").codec("gzip")\.save('my_directory/my_file.gzip')
In this case, this works:
df.write.format("com.databricks.spark.csv").codec("gzip")\ .save('my_directory/my_file.gzip')
这篇关于如何将 DataFrame 保存为压缩(gzipped)CSV?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!