如何保存通过pyspark gzip格式的火花RDD [英] How to save a spark RDD in gzip format through pyspark

查看：596 发布时间：2016/5/22 15:44:24 python apache-spark pyspark

本文介绍了如何保存通过pyspark gzip格式的火花RDD的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

所以我节省了火花RDD使用以下code一S3存储桶。有没有办法对COM preSS（GZ中格式），而是保存其保存为文本文件。

  help_data.repartition（5）.saveAsTextFile（S3：//帮助测试/日志/帮助）

解决方案

saveAsTextFile 方法将它指定COM pression codec类的可选参数：

  help_data.repartition（5）.saveAsTextFile（
    PATH =S3：//帮助测试/日志/帮助
    COM pression codecClass =org.apache.hadoop.io.com press.Gzip codeC
）

So I'm saving a spark RDD to a S3 bucket using following code. Is there a way to compress(in gz format) and save instead of saving it as a text file.

help_data.repartition(5).saveAsTextFile("s3://help-test/logs/help")

解决方案

saveAsTextFile method takes an optional argument which specifies compression codec class:

help_data.repartition(5).saveAsTextFile(
    path="s3://help-test/logs/help",
    compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec"
)

这篇关于如何保存通过pyspark gzip格式的火花RDD的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文