为什么当我在pyspark中收集它们时,我的`binaryFiles`是空的? [英] Why are my `binaryFiles` empty when I collect them in pyspark?

查看:184
本文介绍了为什么当我在pyspark中收集它们时,我的`binaryFiles`是空的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在同一个文件夹中的hdfs上有两个zip文件: / user / path-to-folder-with-zips /



我将它传递给pyspark中的binaryfiles:

  zips = sc.binaryFiles('/ user / path-to-folder-with-zip /')

我试图解压zip文件并对其中的文本文件执行操作,因此我试图在我尝试处理RDD时查看内容的内容。我是这样做的:

  zips_collected = zips.collect()

但是,当我这样做时,它会给出一个空列表:

 >> zips_collected 
[]

我知道拉链不是空的 - 它们有文本文件。 此处的文档说明:

每个文件都是作为单个记录读取的,并以键值对返回,其中键是每个文件的路径,值是每个文件的内容。 / p>

我在这里做错了什么?我知道我无法查看文件的内容,因为它是压缩的,因此是二进制文件。但是,我应该至少可以看到 SOMETHING 。为什么它不返回任何内容?



每个zip文件可以有多个文件,但内容总是如下所示:

  rownum | data | data | data | data | data 
rownum | data | data | data | data | data
rownum | data | data / data | data | data


解决方案

每个zip文件都包含一个文本文件(对于多个文本文件,代码很容易更改)。在逐行处理之前,您需要先通过 io.BytesIO 读取zip文件的内容。解决方案松散地基于 https://stackoverflow.com/a/36511190/234233

 导入io 
导入gzip

def zip_extract(x):
Extract * .gz文件在内存中为Spark
file_obj = gzip.GzipFile(fileobj = io.BytesIO(x [1]),mode =r)
返回file_obj.read()

zip_data = sc.binaryFiles('/ user / path-to-folder-with-zips / * .zip')
results = zip_data.map(zip_extract)\
.flatMap( lambda zip_file:zip_file.split(\\\
))\
.map(lambda行:parse_line(line))
.collect()


I have two zip files on hdfs in the same folder : /user/path-to-folder-with-zips/.

I pass that to "binaryfiles" in pyspark:

zips = sc.binaryFiles('/user/path-to-folder-with-zips/')

I'm trying to unzip the zip files and do things to the text files in them, so I tried to just see what the content will be when I try to deal with the RDD. I did it like this:

zips_collected = zips.collect()

But, when I do that, it gives an empty list:

>> zips_collected
[]

I know that the zips are not empty - they have textfiles. The documentation here says

Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

What am I doing wrong here? I know I can't view the contents of the file because it is zipped and therefore binary. But, I should at least be able to see SOMETHING. Why does it not return anything?

There can be more than one file per zip file, but the contents are always something like this:

rownum|data|data|data|data|data
rownum|data|data|data|data|data
rownum|data|data|data|data|data

解决方案

I'm assuming that each zip file contains a single text file (code is easily changed for multiple text files). You need to read the contents of the zip file first via io.BytesIO before processing line by line. Solution is loosely based on https://stackoverflow.com/a/36511190/234233.

import io
import gzip

def zip_extract(x):
    """Extract *.gz file in memory for Spark"""
    file_obj = gzip.GzipFile(fileobj=io.BytesIO(x[1]), mode="r")
    return file_obj.read()

zip_data = sc.binaryFiles('/user/path-to-folder-with-zips/*.zip')
results = zip_data.map(zip_extract) \
                  .flatMap(lambda zip_file: zip_file.split("\n")) \
                  .map(lambda line: parse_line(line))
                  .collect()

这篇关于为什么当我在pyspark中收集它们时,我的`binaryFiles`是空的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆