处理bzip2压缩在星火JSON文件? [英] Processing bzipped json file in Spark?

查看:198
本文介绍了处理bzip2压缩在星火JSON文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在S3约200文件,例如: a_file.json.bz2 ,这些文件的每一行是一个JSON格式的记录,但一些领域被<连载code> pickle.dumps ,例如:一个日期时间字段。每个文件约1GB 的bZIP COM pression后。现在我需要在星火处理这些文件(pyspark,实际上),但我不能甚至每一个记录了。所以,你会在这里是最好的做法是什么?

I have about 200 files in S3, e.g., a_file.json.bz2, each line of these file is a record in JSON format but some fields were serialised by pickle.dumps, e.g. a datetime field. Each file is about 1GB after bzip compression. Now I need to process these files in Spark (pyspark, actually) but I couldn't even get each record out. So what would be the best practice here?

ds.take(10)

[(0, u'(I551'),
 (6, u'(dp0'),
 (11, u'Vadv_id'),
 (19, u'p1'),
 (22, u'V479883'),
 (30, u'p2'),
 (33, u'sVcpg_id'),
 (42, u'p3'),
 (45, u'V1913398'),
 (54, u'p4')]

显然,分裂是不是每个记录。

Apparently the splitting is not by each record.

感谢您。

推荐答案

我有这个问题,阅读GPG加密文件 。您可以使用 wholeTextFiles 丹尼尔暗示,但是你必须阅读大量文件时作为整个文件处理之前被加载到内存中要小心。如果文件过大时,可能会崩溃的执行人。我用并行 flatMap 。也许沿线的东西。

I had this issue reading gpg-encrypted files. You can use wholeTextFiles as Daniel suggests, but you have to be careful when reading large files as the entire file will be loaded to memory before processing. If the file is too large, it can crash the executor. I used parallelize and flatMap. Maybe something along the lines of

def read_fun_generator(filename):
    with bz2.open(filename, 'rb') as f:
        for line in f:
            yield line.strip()

bz2_filelist = glob.glob("/path/to/files/*.bz2")
rdd_from_bz2 = sc.parallelize(bz2_filelist).flatMap(read_fun_generator)

这篇关于处理bzip2压缩在星火JSON文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆