Apache Spark中的Zip支持 [英] Zip support in Apache Spark

查看:251
本文介绍了Apache Spark中的Zip支持的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已阅读了 gzip -kind输入文件的支持在这里,我不知道是否存在不同种类的压缩文件,如 .zip 文件相同的支持。到目前为止,我已经尝试过计算在 zip 文件下压缩的文件,但 似乎无法成功读取其内容。



我已查看 Hadoop newAPIHadoopFile newAPIHadoopRDD ,但



此外,Spark 支持为指定文件夹下的每个文件创建一个分区,如下例所示:

  SparkConf SpkCnf = new SparkConf()。setAppName(SparkApp)
.setMaster (local [4]);

JavaSparkContext Ctx = new JavaSparkContext(SpkCnf);

JavaRDD< String> FirstRDD = Ctx.textFile(C:\input\).cache();

C:\input\ 指向包含多个文件的目录。



压缩的文件是否可能,是否也可以将每个文件打包到单个压缩文件下,并遵循每个文件一个分区相同的模式?


因为Apache Spark使用Hadoop输入格式,所以我们可以查看hadoop文档,了解如何处理zip文件,看看是否有什么工作。



本网站给我们一个如何使用这个的想法(即我们可以使用ZipFileInputFormat)。说的是,因为zip文件不是拆分表(参见这个)你的请求有一个压缩文件不是真的好支持。



这个问题类似于这个问题,但是它增加了一个额外的问题,如果是可能有一个单一的zip文件因为它不是一个拆分表格式不是一个好主意)。


I have read about Spark's support for gzip-kind input files here, and I wonder if the same support exists for different kind of compressed files, such as .zip files. So far I have tried computing a file compressed under a zip file, but Spark seems unable to read its contents successfully.

I have taken a look to Hadoop's newAPIHadoopFile and newAPIHadoopRDD, but so far I have not been able to get anything working.

In addition, Spark supports creating a partition for every file under a specified folder, like in the example below:

SparkConf SpkCnf = new SparkConf().setAppName("SparkApp")
                                  .setMaster("local[4]");

JavaSparkContext Ctx = new JavaSparkContext(SpkCnf);

JavaRDD<String> FirstRDD = Ctx.textFile("C:\input\).cache();

Where C:\input\ points to a directory with multiple files.

In the case computing zipped files would be possible, would it also be possible to pack every file under a single compressed file and follow the same pattern of one partition per file?

解决方案

Since Apache Spark uses Hadoop Input formats we can look at the hadoop documentation on how to process zip files and see if there is something that works.

This site gives us an idea of how to use this (namely we can use the ZipFileInputFormat). That being said, since zip files are not split-table (see this) your request to have a single compressed file isn't really well supported. Instead, if possible, it would be better to have a directory containing many separate zip files.

This question is similar to this other question, however it adds an additional question of if it would be possible to have a single zip file (which, since it isn't a split-table format isn't a good idea).

这篇关于Apache Spark中的Zip支持的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆