Spark:将 DataFrame 写为压缩的 JSON [英] Spark: writing DataFrame as compressed JSON

查看：40 发布时间：2021/11/14 21:50:43 apache-spark compression gzip dataframe apache-spark-sql

本文介绍了Spark:将 DataFrame 写为压缩的 JSON的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Apache Spark 的 DataFrameReader.json() 可以自动处理 gzipped JSONlines 文件，但似乎没有办法让 DataFrameWriter.json() 写入压缩JSONlines 文件.额外的网络 I/O 在云中非常昂贵.

Apache Spark's DataFrameReader.json() can handle gzipped JSONlines files automatically but there doesn't seem to be a way to get DataFrameWriter.json() to write compressed JSONlines files. The extra network I/O is very expensive in the cloud.

有没有办法解决这个问题?

Is there a way around this problem?

推荐答案

以下解决方案使用 pyspark，但我认为 Scala 中的代码会类似.

The following solutions use pyspark, but I assume the code in Scala would be similar.

第一个选项是在初始化 SparkConf 时设置以下内容:

First option is to set the following when you initialise your SparkConf:

conf = SparkConf()
conf.set("spark.hadoop.mapred.output.compress", "true")
conf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec")
conf.set("spark.hadoop.mapred.output.compression.type", "BLOCK")

使用上面的代码，您使用 sparkContext 生成的任何文件都会使用 gzip 自动压缩.

With the code above any file you produce using that sparkContext is automatically compressed using gzip.

第二个选项，如果您只想压缩上下文中的选定文件.假设df"是您的数据框，文件名是您的目的地:

Second option, if you want to compress only selected files within your context. Lets say "df" is your dataframe and filename your destination:

df_rdd = self.df.toJSON() 
df_rdd.saveAsTextFile(filename,compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

这篇关于Spark:将 DataFrame 写为压缩的 JSON的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark:将 DataFrame 写为压缩的 JSON [英] Spark: writing DataFrame as compressed JSON

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark:将 DataFrame 写为压缩的 JSON [英] Spark: writing DataFrame as compressed JSON

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭