如何将多个压缩文件从S3读取到单个RDD中? [英] How to read multiple gzipped files from S3 into a single RDD?
问题描述
我在S3上存储了许多压缩文件,这些文件是按项目和每天的小时组织的,文件路径的模式为:
I have many gzipped files stored on S3 which are organized by project and hour per day, the pattern of the paths of the files is as:
s3://<bucket>/project1/20141201/logtype1/logtype1.0000.gz
s3://<bucket>/project1/20141201/logtype1/logtype1.0100.gz
....
s3://<bucket>/project1/20141201/logtype1/logtype1.2300.gz
由于应该每天分析数据,因此我必须下载和解压缩属于特定日期的文件,然后将内容组合为单个RDD.
Since the data should be analyzed on a daily basis, I have to download and decompress the files belongs to a specific day, then assemble the content as a single RDD.
应该有几种方法可以做到这一点,但是我想知道Spark的最佳实践.
There should be several ways can do this, but I would like to know the best practice for Spark.
谢谢.
推荐答案
The underlying Hadoop API that Spark uses to access S3 allows you specify input files using a glob expression.
来自 Spark文档:
Spark的所有基于文件的输入法,包括textFile,都支持在目录,压缩文件和通配符上运行.例如,您可以使用
textFile("/my/directory")
,textFile("/my/directory/*.txt")
和textFile("/my/directory/*.gz")
.
因此,在您的情况下,您应该可以使用以下方式将所有这些文件作为一个RDD打开:
So in your case you should be able to open all those files as a single RDD using something like this:
rdd = sc.textFile("s3://bucket/project1/20141201/logtype1/logtype1.*.gz")
仅作记录用途,您还可以使用逗号分隔的列表指定文件,甚至可以将其与*
和?
通配符混合使用.
Just for the record, you can also specify files using a comma-delimited list, and you can even mix that with the *
and ?
wildcards.
例如:
rdd = sc.textFile("s3://bucket/201412??/*/*.gz,s3://bucket/random-file.txt")
简而言之,它的作用是:
Briefly, what this does is:
-
*
匹配所有字符串,因此在这种情况下,将加载201412??
下所有文件夹中的所有gz
文件. -
?
与单个字符匹配,因此201412??
将覆盖2014年12月的所有日期,例如20141201
,20141202
等. - 通过
,
,您可以一次将单独的文件加载到同一RDD中,就像本例中的random-file.txt
.
- The
*
matches all strings, so in this case allgz
files in all folders under201412??
will be loaded. - The
?
matches a single character, so201412??
will cover all days in December 2014 like20141201
,20141202
, and so forth. - The
,
lets you just load separate files at once into the same RDD, like therandom-file.txt
in this case.
关于S3路径的适当URL方案的一些说明:
Some notes about the appropriate URL scheme for S3 paths:
- 如果您在EMR上运行Spark,请正确的URL方案是
s3://
. - 如果您正在运行基于Hadoop 2.7或更高版本构建的开源Spark(即没有专有的Amazon库),则
s3a://
是可行的方法.
在开放源代码方面,已经不推荐使用 -
s3n://
,而推荐使用s3a://
.仅当在Hadoop 2.6或更早版本上运行Spark时,才应使用s3n://
.
- If you're running Spark on EMR, the correct URL scheme is
s3://
. - If you're running open-source Spark (i.e. no proprietary Amazon libraries) built on Hadoop 2.7 or newer,
s3a://
is the way to go. s3n://
has been deprecated on the open source side in favor ofs3a://
. You should only uses3n://
if you're running Spark on Hadoop 2.6 or older.
这篇关于如何将多个压缩文件从S3读取到单个RDD中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!