如何阅读“.gz"使用 spark DF 或 DS 压缩文件? [英] How to read ".gz" compressed file using spark DF or DS?

查看：26 发布时间：2021/11/14 22:07:28 apache-spark apache-spark-sql gzip apache-spark-dataset

本文介绍了如何阅读“.gz"使用 spark DF 或 DS 压缩文件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 .gz 格式的压缩文件，是否可以使用 spark DF/DS 直接读取文件?

I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS?

详细信息:文件是带有制表符分隔的 csv.

Details : File is csv with tab delimited.

推荐答案

读取压缩的 csv 与读取未压缩的 csv 文件的方式相同.对于 Spark 2.0+ 版本，可以使用 Scala 完成如下操作(注意制表符分隔符的额外选项):

Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):

val df = spark.read.option("sep", "\t").csv("file.csv.gz")

PySpark:

df = spark.read.csv("file.csv.gz", sep='\t')

唯一需要考虑的额外因素是 gz 文件不可拆分，因此 Spark 需要使用单个核心读取整个文件，这会减慢速度.读取完成后，可以将数据打乱以增加并行度.

The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.

这篇关于如何阅读“.gz"使用 spark DF 或 DS 压缩文件?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何阅读“.gz"使用 spark DF 或 DS 压缩文件? [英] How to read ".gz" compressed file using spark DF or DS?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何阅读“.gz"使用 spark DF 或 DS 压缩文件? [英] How to read &quot;.gz&quot; compressed file using spark DF or DS?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

如何阅读“.gz"使用 spark DF 或 DS 压缩文件? [英] How to read ".gz" compressed file using spark DF or DS?

登录关闭