Spark:将DataFrame写为压缩的JSON [英] Spark: writing DataFrame as compressed JSON
问题描述
Apache Spark的 DataFrameReader.json()
可以自动处理gzip压缩的JSONlines文件,但似乎没有办法得到 DataFrameWriter.json ()
来编写压缩的JSONlines文件。
Apache Spark's DataFrameReader.json()
can handle gzipped JSONlines files automatically but there doesn't seem to be a way to get DataFrameWriter.json()
to write compressed JSONlines files. The extra network I/O is very expensive in the cloud.
有没有办法解决这个问题?
Is there a way around this problem?
推荐答案
以下解决方案使用pyspark,但我认为Scala中的代码类似。
The following solutions use pyspark, but I assume the code in Scala would be similar.
第一个选项是设置以下当你初始化你的SparkConf:
First option is to set the following when you initialise your SparkConf:
conf = SparkConf()
conf.set("spark.hadoop.mapred.output.compress", "true")
conf.set("spark.hadoop.mapred.output.compression.codec", "true")
conf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec")
conf.set("spark.hadoop.mapred.output.compression.type", "BLOCK")
使用上面的代码,使用sparkContext生成的任何文件都会使用gzip自动压缩。
With the code above any file you produce using that sparkContext is automatically compressed using gzip.
第二个选项,如果您只想压缩上下文中的选定文件。让我们说df是你的数据框和文件名你的目的地:
Second option, if you want to compress only selected files within your context. Lets say "df" is your dataframe and filename your destination:
df_rdd = self.df.toJSON()
df_rdd.saveAsTextFile(filename,compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")
这篇关于Spark:将DataFrame写为压缩的JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!