腌制Spark RDD并将其读入Python [英] Pickling a Spark RDD and reading it into Python
问题描述
我试图通过腌制来序列化Spark RDD,然后将腌制的文件直接读取到Python中.
I am trying to serialize a Spark RDD by pickling it, and read the pickled file directly into Python.
a = sc.parallelize(['1','2','3','4','5'])
a.saveAsPickleFile('test_pkl')
然后我将test_pkl文件复制到本地.如何将它们直接读入Python?当我尝试普通的泡菜包装时,当我尝试读取"test_pkl"的第一个泡菜部分时,它会失败:
I then copy the test_pkl files to my local. How can I read them directly into Python? When I try the normal pickle package, it fails when I attempt to read the first pickle part of 'test_pkl':
pickle.load(open('part-00000','rb'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/pickle.py", line 1370, in load
return Unpickler(file).load()
File "/usr/lib64/python2.6/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib64/python2.6/pickle.py", line 970, in load_string
raise ValueError, "insecure string pickle"
ValueError: insecure string pickle
我假设spark正在使用的腌制方法与python pickle方法不同(如果我错了,请纠正我).我有什么办法可以从Spark腌制数据并将此腌制对象直接从文件中读取到python中?
I assume that the pickling method that spark is using is different than the python pickle method (correct me if I am wrong). Is there any way for me to pickle data from Spark and read this pickled object directly into python from the file?
推荐答案
It is possible using sparkpickle project. As simple as
with open("/path/to/file", "rb") as f:
print(sparkpickle.load(f))
这篇关于腌制Spark RDD并将其读入Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!