将tar.gz存档中压缩的多个文件读入Spark [英] Reading in multiple files compressed in tar.gz archive into Spark
问题描述
我正在尝试从多个压缩为tar的json文件中创建Spark RDD. 例如,我有3个文件
I'm trying to create a Spark RDD from several json files compressed into a tar. For example, I have 3 files
file1.json
file2.json
file3.json
这些包含在archive.tar.gz
中.
我想从json文件创建一个数据框.问题是Spark无法正确读取json文件.使用sqlContext.read.json("archive.tar.gz")
或sc.textFile("archive.tar.gz")
创建RDD会导致乱码/多余的输出.
I want to create a dataframe from the json files. The problem is that Spark is not reading in the json files correctly. Creating an RDD using sqlContext.read.json("archive.tar.gz")
or sc.textFile("archive.tar.gz")
results in garbled/extra output.
是否有某种方法可以处理Spark中包含多个文件的压缩存档?
Is there some way to handle gzipped archives containing multiple files in Spark?
更新
Using the method given in the answer to Read whole text files from a compression in Spark I was able to get things running, but this method does not seem to be suitable for large tar.gz archives (>200 mb compressed) as the application chokes up on large archive sizes. As some of the archives I'm dealing with reach sizes upto 2 GB after compression I'm wondering if there is some efficient way to deal with the problem.
我试图避免提取档案,然后将文件合并在一起,因为这很耗时.
I'm trying to avoid extracting the archives and then merging the files together as this would be time consuming.
推荐答案
A solution is given in Read whole text files from a compression in Spark . Using the code sample provided, I was able to create a dataframe from the compressed archive like so:
val jsonRDD = sc.binaryFiles("gzarchive/*").
flatMapValues(x => extractFiles(x).toOption).
mapValues(_.map(decode())
val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x))
此方法适用于较小规模的tar归档文件,但不适用于较大的归档文件大小.
This method works fine for tar archives of a relatively small size, but is not suitable for large archive sizes.
一个更好的解决方案似乎是将tar存档转换为hadoop SequenceFile,它们是可拆分的,因此可以在Spark中并行读取和处理(与tar存档相反).
A better solution to the problem seems to be to convert the tar archives to hadoop SequenceFiles, which are splittable and hence can be read and processed in parallel in Spark (as opposed to tar archives.)
请参阅:stuartsierra.com/2008/04/24/a-million-little-files
See: stuartsierra.com/2008/04/24/a-million-little-files
这篇关于将tar.gz存档中压缩的多个文件读入Spark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!