如何将多个压缩文件从S3读取到单个RDD中? [英] How to read multiple gzipped files from S3 into a single RDD?

查看：81 发布时间：2020/8/23 4:30:02 amazon-s3 apache-spark

本文介绍了如何将多个压缩文件从S3读取到单个RDD中?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在S3上存储了许多压缩文件，这些文件是按项目和每天的小时组织的，文件路径的模式为:

I have many gzipped files stored on S3 which are organized by project and hour per day, the pattern of the paths of the files is as:

s3://<bucket>/project1/20141201/logtype1/logtype1.0000.gz
s3://<bucket>/project1/20141201/logtype1/logtype1.0100.gz
....
s3://<bucket>/project1/20141201/logtype1/logtype1.2300.gz

由于应该每天分析数据，因此我必须下载和解压缩属于特定日期的文件，然后将内容组合为单个RDD.

Since the data should be analyzed on a daily basis, I have to download and decompress the files belongs to a specific day, then assemble the content as a single RDD.

应该有几种方法可以做到这一点，但是我想知道Spark的最佳实践.

There should be several ways can do this, but I would like to know the best practice for Spark.

谢谢.

推荐答案

Spark用于访问S3的基础Hadoop API允许您使用

The underlying Hadoop API that Spark uses to access S3 allows you specify input files using a glob expression.

来自 Spark文档:

Spark的所有基于文件的输入法，包括textFile，都支持在目录，压缩文件和通配符上运行.例如，您可以使用textFile("/my/directory")，textFile("/my/directory/*.txt")和textFile("/my/directory/*.gz").

因此，在您的情况下，您应该可以使用以下方式将所有这些文件作为一个RDD打开:

So in your case you should be able to open all those files as a single RDD using something like this:

rdd = sc.textFile("s3://bucket/project1/20141201/logtype1/logtype1.*.gz")

仅作记录用途，您还可以使用逗号分隔的列表指定文件，甚至可以将其与*和?通配符混合使用.

Just for the record, you can also specify files using a comma-delimited list, and you can even mix that with the * and ? wildcards.

例如:

rdd = sc.textFile("s3://bucket/201412??/*/*.gz,s3://bucket/random-file.txt")

简而言之，它的作用是:

Briefly, what this does is:

*匹配所有字符串，因此在这种情况下，将加载201412??下所有文件夹中的所有gz文件.
?与单个字符匹配，因此201412??将覆盖2014年12月的所有日期，例如20141201，20141202等.
通过,，您可以一次将单独的文件加载到同一RDD中，就像本例中的random-file.txt.

The * matches all strings, so in this case all gz files in all folders under 201412?? will be loaded.
The ? matches a single character, so 201412?? will cover all days in December 2014 like 20141201, 20141202, and so forth.
The , lets you just load separate files at once into the same RDD, like the random-file.txt in this case.

关于S3路径的适当URL方案的一些说明:

Some notes about the appropriate URL scheme for S3 paths:

如果您在EMR上运行Spark，请正确的URL方案是s3:// .
如果您正在运行基于Hadoop 2.7或更高版本构建的开源Spark(即没有专有的Amazon库)，则s3a://是可行的方法.
s3n:// ，而推荐使用s3a://.仅当在Hadoop 2.6或更早版本上运行Spark时，才应使用s3n://.

If you're running Spark on EMR, the correct URL scheme is s3://.
If you're running open-source Spark (i.e. no proprietary Amazon libraries) built on Hadoop 2.7 or newer, s3a:// is the way to go.
s3n:// has been deprecated on the open source side in favor of s3a://. You should only use s3n:// if you're running Spark on Hadoop 2.6 or older.

这篇关于如何将多个压缩文件从S3读取到单个RDD中?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何将多个压缩文件从S3读取到单个RDD中? [英] How to read multiple gzipped files from S3 into a single RDD?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何将多个压缩文件从S3读取到单个RDD中? [英] How to read multiple gzipped files from S3 into a single RDD?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭