如何将多个 gzip 压缩文件从 S3 读取到单个 RDD 中? [英] How to read multiple gzipped files from S3 into a single RDD?

查看:32
本文介绍了如何将多个 gzip 压缩文件从 S3 读取到单个 RDD 中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 S3 上存储了许多按项目和小时组织的 gzip 文件,文件路径的模式如下:

I have many gzipped files stored on S3 which are organized by project and hour per day, the pattern of the paths of the files is as:

s3://<bucket>/project1/20141201/logtype1/logtype1.0000.gz
s3://<bucket>/project1/20141201/logtype1/logtype1.0100.gz
....
s3://<bucket>/project1/20141201/logtype1/logtype1.2300.gz

由于每天都要分析数据,所以我必须下载并解压属于特定日期的文件,然后将内容组装为单个RDD.

Since the data should be analyzed on a daily basis, I have to download and decompress the files belongs to a specific day, then assemble the content as a single RDD.

应该有几种方法可以做到这一点,但我想知道 Spark 的最佳实践.

There should be several ways can do this, but I would like to know the best practice for Spark.

提前致谢.

推荐答案

Spark 用于访问 S3 的底层 Hadoop API 允许您使用 glob 表达式.

The underlying Hadoop API that Spark uses to access S3 allows you specify input files using a glob expression.

来自 Spark 文档:

Spark 的所有基于文件的输入方法,包括 textFile,都支持在目录、压缩文件和通配符上运行.例如,您可以使用 textFile("/my/directory")textFile("/my/directory/*.txt")textFile("/my/directory/*.gz").

All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").

因此,在您的情况下,您应该能够使用以下内容将所有这些文件作为单个 RDD 打开:

So in your case you should be able to open all those files as a single RDD using something like this:

rdd = sc.textFile("s3://bucket/project1/20141201/logtype1/logtype1.*.gz")

只是为了记录,您还可以使用逗号分隔的列表指定文件,您甚至可以将其与 *? 通配符混合使用.

Just for the record, you can also specify files using a comma-delimited list, and you can even mix that with the * and ? wildcards.

例如:

rdd = sc.textFile("s3://bucket/201412??/*/*.gz,s3://bucket/random-file.txt")

简而言之,它的作用是:

Briefly, what this does is:

  • * 匹配所有字符串,因此在这种情况下,将加载 201412?? 下所有文件夹中的所有 gz 文件.莉>
  • ? 匹配单个字符,因此 201412?? 将涵盖 2014 年 12 月的所有日子,如 2014120120141202,等等.
  • , 可以让你一次性将不同的文件加载到同一个 RDD 中,比如本例中的 random-file.txt.
  • The * matches all strings, so in this case all gz files in all folders under 201412?? will be loaded.
  • The ? matches a single character, so 201412?? will cover all days in December 2014 like 20141201, 20141202, and so forth.
  • The , lets you just load separate files at once into the same RDD, like the random-file.txt in this case.

有关 S3 路径的适当 URL 方案的一些说明:

Some notes about the appropriate URL scheme for S3 paths:

  • 如果您在 EMR 上运行 Spark,正确的 URL 方案是 s3://.
  • 如果您运行的是基于 Hadoop 2.7 或更高版本构建的开源 Spark(即没有专有的 Amazon 库),s3a:// 是您的最佳选择.
  • s3n:// 已在开源上弃用支持 s3a://.如果您在 Hadoop 2.6 或更早版本上运行 Spark,则应仅使用 s3n://.
  • If you're running Spark on EMR, the correct URL scheme is s3://.
  • If you're running open-source Spark (i.e. no proprietary Amazon libraries) built on Hadoop 2.7 or newer, s3a:// is the way to go.
  • s3n:// has been deprecated on the open source side in favor of s3a://. You should only use s3n:// if you're running Spark on Hadoop 2.6 or older.

这篇关于如何将多个 gzip 压缩文件从 S3 读取到单个 RDD 中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆