将tar.gz存档中压缩的多个文件读入Spark [英] Reading in multiple files compressed in tar.gz archive into Spark

查看:206
本文介绍了将tar.gz存档中压缩的多个文件读入Spark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从多个压缩为tar的json文件中创建Spark RDD. 例如,我有3个文件

I'm trying to create a Spark RDD from several json files compressed into a tar. For example, I have 3 files

file1.json
file2.json
file3.json

这些包含在archive.tar.gz中.

我想从json文件创建一个数据框.问题是Spark无法正确读取json文件.使用sqlContext.read.json("archive.tar.gz")sc.textFile("archive.tar.gz")创建RDD会导致乱码/多余的输出.

I want to create a dataframe from the json files. The problem is that Spark is not reading in the json files correctly. Creating an RDD using sqlContext.read.json("archive.tar.gz") or sc.textFile("archive.tar.gz") results in garbled/extra output.

是否有某种方法可以处理Spark中包含多个文件的压缩存档?

Is there some way to handle gzipped archives containing multiple files in Spark?

更新

使用答案中给出的方法阅读全文我可以通过Spark中的压缩文件来运行文件,但是我可以使事情运行,但是这种方法似乎不适用于大型tar.gz归档文件(压缩后大于200 mb),因为应用程序占用了较大的归档文件大小.压缩后,由于我处理的某些归档文件达到了 2 GB ,我想知道是否存在某种有效的方法来解决该问题.

Using the method given in the answer to Read whole text files from a compression in Spark I was able to get things running, but this method does not seem to be suitable for large tar.gz archives (>200 mb compressed) as the application chokes up on large archive sizes. As some of the archives I'm dealing with reach sizes upto 2 GB after compression I'm wondering if there is some efficient way to deal with the problem.

我试图避免提取档案,然后将文件合并在一起,因为这很耗时.

I'm trying to avoid extracting the archives and then merging the files together as this would be time consuming.

推荐答案

A solution is given in Read whole text files from a compression in Spark . Using the code sample provided, I was able to create a dataframe from the compressed archive like so:

val jsonRDD = sc.binaryFiles("gzarchive/*").
               flatMapValues(x => extractFiles(x).toOption).
               mapValues(_.map(decode())

val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x))

此方法适用于较小规模的tar归档文件,但不适用于较大的归档文件大小.

This method works fine for tar archives of a relatively small size, but is not suitable for large archive sizes.

一个更好的解决方案似乎是将tar存档转换为hadoop SequenceFile,它们是可拆分的,因此可以在Spark中并行读取和处理(与tar存档相反).

A better solution to the problem seems to be to convert the tar archives to hadoop SequenceFiles, which are splittable and hence can be read and processed in parallel in Spark (as opposed to tar archives.)

请参阅:stuartsierra.com/2008/04/24/a-million-little-files

See: stuartsierra.com/2008/04/24/a-million-little-files

这篇关于将tar.gz存档中压缩的多个文件读入Spark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆