将类函数传递给PySpark RDD [英] Passing class functions to PySpark RDD
问题描述
我在这里的Python文件中有一个名为some_class()的类:
I have a class named some_class() in a Python file here:
/some-folder/app/bin/file.py
我将其导入到我的代码中
I am importing it to my code here:
/some-folder2/app/code/file2.py
通过
import sys
sys.path.append('/some-folder/app/bin')
from file import some_class
clss = some_class()
我想在Spark地图中使用名为some_function的此类函数
I want to use this class's function named some_function in map of spark
sc.parallelize(some_data_iterator).map(lambda x: clss.some_function(x))
这给我一个错误:
No module named file
当我在pyspark的map函数之外调用它时的class.some_function,即正常情况下,但在pySpark的RDD中却没有.我认为这与pyspark有关.我不知道我在哪里出错了.
While class.some_function when I am calling it outside map function of pyspark i.e. normally but not in pySpark's RDD. I think this has something to do with pyspark. I have no idea where am I going wrong in this.
我尝试广播这堂课,但还是没用.
I tried broadcasting this class and still didn't work.
推荐答案
所有Python依赖项都必须存在于工作节点的搜索路径上,或者必须使用SparkContext.addPyFile
方法手动分发,因此类似的事情应该可以解决:
All Python dependencies have to be either present on the search path of the worker nodes or distributed manually using SparkContext.addPyFile
method so something like this should do the trick:
sc.addPyFile("/some-folder/app/bin/file.py")
它将文件复制到所有工作程序并放置在工作目录中.
It will copy the file to all the workers and place in the working directory.
在旁注中,即使仅作为示例,也请不要使用file
作为模块名称.在Python中隐藏内置函数并不是一个好主意.
On a side note please don't use file
as module name, even if it is only an example. Shadowing built-in functions in Python is not a very good idea.
这篇关于将类函数传递给PySpark RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!