如何读取包含在Apache中星火多个文件的zip [英] How to read a zip containing multiple files in Apache Spark

查看：286 发布时间：2016/5/22 16:34:43 scala apache-spark pyspark

本文介绍了如何读取包含在Apache中星火多个文件的zip的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含多个文本文件的压缩文件。
我想读的每个文件，并建立RDD的containining每个文件的内容列表。

  VAL测试= sc.textFile（/卷/工作/数据/ kaggle /拿督/测试/ 5.zip）

只是整个文件，但如何通过拉链的每个内容重复，然后使用星火保存在相同RDD。

我很好使用Scala或Python。

在Python可能的解决方法用星火 -

 存档= zipfile.ZipFile（ARCHIVE_PATH，'R'）
file_paths = zipfile.ZipFile.namelist（存档）
在file_paths FILE_PATH：
    网址= file_path.split（/）
    urlId =网址[-1] .split（_）[0]

解决方案

如果你正在阅读的二进制文件使用 sc.binaryFiles 。这将返回一个包含文件名的元组的RDD和 PortableDataStream 。您可以将后者送入一个 ZipInputStream 。

I am having a Zipped file containing multiple text files. I want to read each of the file and build a List of RDD containining the content of each files.

val test = sc.textFile("/Volumes/work/data/kaggle/dato/test/5.zip")

will just entire files, but how to iterate through each content of zip and then save the same in RDD using Spark.

I am fine with Scala or Python.

Possible solution in Python with using Spark -

archive = zipfile.ZipFile(archive_path, 'r')
file_paths = zipfile.ZipFile.namelist(archive)
for file_path in file_paths:
    urls = file_path.split("/")
    urlId = urls[-1].split('_')[0]

解决方案

If you are reading binary files use sc.binaryFiles. This will return an RDD of tuples containing the file name and a PortableDataStream. You can feed the latter into a ZipInputStream.

这篇关于如何读取包含在Apache中星火多个文件的zip的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何读取包含在Apache中星火多个文件的zip [英] How to read a zip containing multiple files in Apache Spark

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何读取包含在Apache中星火多个文件的zip [英] How to read a zip containing multiple files in Apache Spark

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭