Spark：将DataFrame写为压缩的JSON [英] Spark: writing DataFrame as compressed JSON

查看：1008 发布时间：2016/12/25 12:56:41 apache-spark compression gzip dataframe apache-spark-sql

本文介绍了Spark：将DataFrame写为压缩的JSON的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Apache Spark的 DataFrameReader.json（）可以自动处理gzip压缩的JSONlines文件，但似乎没有办法得到 DataFrameWriter.json （）来编写压缩的JSONlines文件。

Apache Spark's DataFrameReader.json() can handle gzipped JSONlines files automatically but there doesn't seem to be a way to get DataFrameWriter.json() to write compressed JSONlines files. The extra network I/O is very expensive in the cloud.

有没有办法解决这个问题？

Is there a way around this problem?

推荐答案

以下解决方案使用pyspark，但我认为Scala中的代码类似。

The following solutions use pyspark, but I assume the code in Scala would be similar.

第一个选项是设置以下当你初始化你的SparkConf：

First option is to set the following when you initialise your SparkConf:

conf = SparkConf()
conf.set("spark.hadoop.mapred.output.compress", "true")
conf.set("spark.hadoop.mapred.output.compression.codec", "true")
conf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec")
conf.set("spark.hadoop.mapred.output.compression.type", "BLOCK")

使用上面的代码，使用sparkContext生成的任何文件都会使用gzip自动压缩。

With the code above any file you produce using that sparkContext is automatically compressed using gzip.

第二个选项，如果您只想压缩上下文中的选定文件。让我们说df是你的数据框和文件名你的目的地：

Second option, if you want to compress only selected files within your context. Lets say "df" is your dataframe and filename your destination:

df_rdd = self.df.toJSON() 
df_rdd.saveAsTextFile(filename,compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

这篇关于Spark：将DataFrame写为压缩的JSON的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark：将DataFrame写为压缩的JSON [英] Spark: writing DataFrame as compressed JSON

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark：将DataFrame写为压缩的JSON [英] Spark: writing DataFrame as compressed JSON

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭