腌制Spark RDD并将其读入Python [英] Pickling a Spark RDD and reading it into Python

查看:101
本文介绍了腌制Spark RDD并将其读入Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过腌制来序列化Spark RDD,然后将腌制的文件直接读取到Python中.

I am trying to serialize a Spark RDD by pickling it, and read the pickled file directly into Python.

a = sc.parallelize(['1','2','3','4','5'])
a.saveAsPickleFile('test_pkl')

然后我将test_pkl文件复制到本地.如何将它们直接读入Python?当我尝试普通的泡菜包装时,当我尝试读取"test_pkl"的第一个泡菜部分时,它会失败:

I then copy the test_pkl files to my local. How can I read them directly into Python? When I try the normal pickle package, it fails when I attempt to read the first pickle part of 'test_pkl':

pickle.load(open('part-00000','rb'))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.6/pickle.py", line 1370, in load
    return Unpickler(file).load()
  File "/usr/lib64/python2.6/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib64/python2.6/pickle.py", line 970, in load_string
    raise ValueError, "insecure string pickle"
ValueError: insecure string pickle

我假设spark正在使用的腌制方法与python pickle方法不同(如果我错了,请纠正我).我有什么办法可以从Spark腌制数据并将此腌制对象直接从文件中读取到python中?

I assume that the pickling method that spark is using is different than the python pickle method (correct me if I am wrong). Is there any way for me to pickle data from Spark and read this pickled object directly into python from the file?

推荐答案

可以使用 sparkpickle 项目.如此简单

It is possible using sparkpickle project. As simple as

with open("/path/to/file", "rb") as f:
    print(sparkpickle.load(f))

这篇关于腌制Spark RDD并将其读入Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆