将压缩在 tar.gz 存档中的多个文件读入 Spark [英] Reading in multiple files compressed in tar.gz archive into Spark

查看:72
本文介绍了将压缩在 tar.gz 存档中的多个文件读入 Spark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从压缩成 tar 的几个 json 文件创建一个 Spark RDD.例如,我有 3 个文件

I'm trying to create a Spark RDD from several json files compressed into a tar. For example, I have 3 files

file1.json
file2.json
file3.json

这些都包含在archive.tar.gz中.

我想从 json 文件创建一个数据框.问题是 Spark 没有正确读取 json 文件.使用 sqlContext.read.json("archive.tar.gz")sc.textFile("archive.tar.gz") 创建 RDD 会导致乱码/额外输出.

I want to create a dataframe from the json files. The problem is that Spark is not reading in the json files correctly. Creating an RDD using sqlContext.read.json("archive.tar.gz") or sc.textFile("archive.tar.gz") results in garbled/extra output.

有没有办法在 Spark 中处理包含多个文件的 gzip 压缩档案?

Is there some way to handle gzipped archives containing multiple files in Spark?

更新

使用答案中给出的方法阅读全文来自 Spark 中压缩的文件 我能够让事情运行,但这种方法似乎不适合大型 tar.gz 档案(> 200 mb 压缩),因为应用程序因大档案大小而窒息.作为一些档案,我正在处理压缩后高达 2 GB 的范围大小,我想知道是否有一些有效的方法来处理这个问题.

Using the method given in the answer to Read whole text files from a compression in Spark I was able to get things running, but this method does not seem to be suitable for large tar.gz archives (>200 mb compressed) as the application chokes up on large archive sizes. As some of the archives I'm dealing with reach sizes upto 2 GB after compression I'm wondering if there is some efficient way to deal with the problem.

我试图避免提取档案然后将文件合并在一起,因为这会很耗时.

I'm trying to avoid extracting the archives and then merging the files together as this would be time consuming.

推荐答案

从 Spark 中的压缩读取整个文本文件.使用提供的代码示例,我能够从压缩存档中创建一个 DataFrame,如下所示:

A solution is given in Read whole text files from a compression in Spark . Using the code sample provided, I was able to create a DataFrame from the compressed archive like so:

val jsonRDD = sc.binaryFiles("gzarchive/*").
               flatMapValues(x => extractFiles(x).toOption).
               mapValues(_.map(decode())

val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x))

此方法适用于相对较小的 tar 归档文件,但不适用于较大的归档文件.

This method works fine for tar archives of a relatively small size, but is not suitable for large archive sizes.

该问题的更好解决方案似乎是将 tar 存档转换为 Hadoop SequenceFiles,后者是可拆分的,因此可以在 Spark 中并行读取和处理(与 tar 存档相反).

A better solution to the problem seems to be to convert the tar archives to Hadoop SequenceFiles, which are splittable and hence can be read and processed in parallel in Spark (as opposed to tar archives.)

请参阅:一百万个小文件——Stuart Sierra 的数字题外话.

这篇关于将压缩在 tar.gz 存档中的多个文件读入 Spark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆