Spark支持gzip格式吗? [英] Is gzip format supported in Spark?

查看:310
本文介绍了Spark支持gzip格式吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于大数据项目,我计划使用 spark ,它具有一些不错的功能,例如用于重复工作负载的内存计算.它可以在本地文件或HDFS之上运行.

For a Big Data project, I'm planning to use spark, which has some nice features like in-memory-computations for repeated workloads. It can run on local files or on top of HDFS.

但是,在官方文档中,我找不到有关如何处理压缩文件的任何提示.实际上,处理.gz文件而不是解压缩的文件可能非常有效.

However, in the official documentation, I can't find any hint as to how to process gzipped files. In practice, it can be quite efficient to process .gz files instead of unzipped files.

有没有一种方法可以手动实现gzip压缩文件的读取,或者在读取.gz文件时已经自动完成了解压缩?

Is there a way to manually implement reading of gzipped files or is unzipping already automatically done when reading a .gz file?

推荐答案

来自Spark Scala编程指南的:

From the Spark Scala Programming guide's section on "Hadoop Datasets":

Spark可以从存储在Hadoop分布式文件系统(HDFS)或Hadoop支持的其他存储系统(包括本地文件系统,Amazon S3,Hypertable,HBase等)中的任何文件中创建分布式数据集. Spark支持文本文件,SequenceFiles和任何其他Hadoop InputFormat.

Spark can create distributed datasets from any file stored in the Hadoop distributed file system (HDFS) or other storage systems supported by Hadoop (including your local file system, Amazon S3, Hypertable, HBase, etc). Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

对gzip输入文件的支持应与Hadoop中的支持相同.例如,sc.textFile("myFile.gz")应该自动解压缩并读取gzip压缩的文件(textFile()实际上是

Support for gzip input files should work the same as it does in Hadoop. For example, sc.textFile("myFile.gz") should automatically decompress and read gzip-compressed files (textFile() is actually implemented using Hadoop's TextInputFormat, which supports gzip-compressed files).

如@ nick-chammas在评论中所述:

As mentioned by @nick-chammas in the comments:

请注意,如果您在压缩文件中调用sc.textFile(),Spark会给出 您只有一个分区(从0.9.0开始)的RDD.这是因为 压缩文件为

note that if you call sc.textFile() on a gzipped file, Spark will give you an RDD with only 1 partition (as of 0.9.0). This is because gzipped files are not splittable. If you don't repartition the RDD somehow, any operations on that RDD will be limited to a single core

这篇关于Spark支持gzip格式吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆