如何读取".gz"使用Spark DF或DS压缩文件? [英] Ho to read ".gz" compressed file using spark DF or DS?

查看:219
本文介绍了如何读取".gz"使用Spark DF或DS压缩文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个.gz格式的压缩文件,是否可以使用spark DF/DS直接读取该文件?

I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS?

详细信息:文件为带有制表符分隔的csv.

Details : File is csv with tab delimited.

推荐答案

读取压缩的csv的方式与读取未压缩的csv文件的方式相同.对于Spark版本2.0+,可以使用Scala通过以下方式完成操作(请注意制表符分隔符的额外选项):

Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):

val df = spark.read.option("sep", "\t").csv("file.csv.gz")

PySpark:

df = spark.read.csv("file.csv.gz", sep='\t')

要考虑的唯一额外考虑因素是gz文件不可拆分,因此Spark需要使用单个内核读取整个文件,这会减慢速度.读取完成后,可以重新整理数据以提高并行度.

The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.

这篇关于如何读取".gz"使用Spark DF或DS压缩文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆