ModuleNotFoundError,因为PySpark序列化程序无法找到库文件夹 [英] ModuleNotFoundError because PySpark serializer is not able to locate library folder
问题描述
我具有以下文件夹结构
- libfolder
- lib1.py
- lib2.py
- main.py
main.py
调用libfolder.lib1.py
,然后调用libfolder.lib2.py
和其他.
main.py
calls libfolder.lib1.py
which then calls libfolder.lib2.py
and others.
在本地计算机上一切正常,但是将其部署到Dataproc后,出现以下错误
It all works perfectly fine in local machine but after I deploy it to Dataproc I get the following error
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 455, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'libfolder'
我已将文件夹压缩到xyz.zip
中,并运行以下命令:
I have zipped the folder into xyz.zip
and run the following command:
spark-submit --py-files=xyz.zip main.py
序列化程序无法找到libfolder
的位置.我打包文件夹的方式有问题吗?
The serializer is not able to find the location for libfolder
. Is there a problem with the way i am packaging my folders?
This issue is similar to this one but it's not answered.
编辑:回答Igor的问题
response to Igor's questions
zip文件的unzip -l返回以下内容
unzip -l for the zip file returns the following
- libfolder
- lib1.py
- lib2.py
- main.py
在main.py中,使用此import语句调用lib1.py
In main.py lib1.py is called with this import statement
from libfolder import lib1
推荐答案
这对我有用:
$ cat main.py
from pyspark import SparkContext, SparkConf
from subpkg import sub
conf = SparkConf().setAppName("Shell Count")
sc = SparkContext(conf = conf)
text_file = sc.textFile("file:///etc/passwd")
counts = text_file.map(lambda line: sub.map(line)) \
.map(lambda shell: (shell, 1)) \
.reduceByKey(lambda a, b: sub.reduce(a, b))
counts.saveAsTextFile("hdfs:///count5.txt")
$ cat subpkg/sub.py
def map(line):
return line.split(":")[6]
def reduce(a, b):
return a + b
$ unzip -l /tmp/deps.zip
Archive: /tmp/deps.zip
Length Date Time Name
--------- ---------- ----- ----
0 2019-01-07 14:22 subpkg/
0 2019-01-07 13:51 subpkg/__init__.py
79 2019-01-07 14:13 subpkg/sub.py
--------- -------
79 3 files
$ gcloud dataproc jobs submit pyspark --cluster test-cluster main.py --py-files deps.zip
Job [1f0f15108a4149c5942f49513ce04440] submitted.
Waiting for job output...
Hello world!
Job [1f0f15108a4149c5942f49513ce04440] finished successfully.
这篇关于ModuleNotFoundError,因为PySpark序列化程序无法找到库文件夹的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!