ModuleNotFoundError 因为 PySpark 序列化程序无法找到库文件夹 [英] ModuleNotFoundError because PySpark serializer is not able to locate library folder

查看：15 发布时间：2021/12/22 21:30:46 python apache-spark pyspark google-cloud-dataproc

本文介绍了ModuleNotFoundError 因为 PySpark 序列化程序无法找到库文件夹的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下文件夹结构

 - libfolder
    - lib1.py
    - lib2.py
 - main.py

main.py 调用 libfolder.lib1.py 然后调用 libfolder.lib2.py 和其他.

main.py calls libfolder.lib1.py which then calls libfolder.lib2.py and others.

在本地机器上一切正常，但在我将其部署到 Dataproc 后，出现以下错误

It all works perfectly fine in local machine but after I deploy it to Dataproc I get the following error

File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 455, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'libfolder'

我已将文件夹压缩到 xyz.zip 并运行以下命令:

I have zipped the folder into xyz.zip and run the following command:

spark-submit --py-files=xyz.zip main.py

序列化程序无法找到 libfolder 的位置.我打包文件夹的方式有问题吗?

The serializer is not able to find the location for libfolder . Is there a problem with the way i am packaging my folders?

这个问题类似于这个但没有回答.

This issue is similar to this one but it's not answered.

回应伊戈尔的问题

解压 -l 为 zip 文件返回以下内容

unzip -l for the zip file returns the following

 - libfolder
    - lib1.py
    - lib2.py
 - main.py

在 main.py 中使用这个导入语句调用 lib1.py

In main.py lib1.py is called with this import statement

from libfolder import lib1

推荐答案

这对我有用:

$ cat main.py

from pyspark import SparkContext, SparkConf

from subpkg import sub

conf = SparkConf().setAppName("Shell Count")
sc = SparkContext(conf = conf)

text_file = sc.textFile("file:///etc/passwd")
counts = text_file.map(lambda line: sub.map(line)) 
    .map(lambda shell: (shell, 1)) 
    .reduceByKey(lambda a, b: sub.reduce(a, b))

counts.saveAsTextFile("hdfs:///count5.txt")

$ cat subpkg/sub.py

def map(line):
  return line.split(":")[6]

def reduce(a, b):
  return a + b

$ unzip -l /tmp/deps.zip 
Archive:  /tmp/deps.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2019-01-07 14:22   subpkg/
        0  2019-01-07 13:51   subpkg/__init__.py
       79  2019-01-07 14:13   subpkg/sub.py
---------                     -------
       79                     3 files


$ gcloud dataproc jobs submit pyspark --cluster test-cluster main.py --py-files deps.zip
Job [1f0f15108a4149c5942f49513ce04440] submitted.
Waiting for job output...
Hello world!
Job [1f0f15108a4149c5942f49513ce04440] finished successfully.

这篇关于ModuleNotFoundError 因为 PySpark 序列化程序无法找到库文件夹的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

ModuleNotFoundError 因为 PySpark 序列化程序无法找到库文件夹 [英] ModuleNotFoundError because PySpark serializer is not able to locate library folder

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

ModuleNotFoundError 因为 PySpark 序列化程序无法找到库文件夹 [英] ModuleNotFoundError because PySpark serializer is not able to locate library folder

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭