将类函数传递给PySpark RDD [英] Passing class functions to PySpark RDD

查看:174
本文介绍了将类函数传递给PySpark RDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里的Python文件中有一个名为some_class()的类:

I have a class named some_class() in a Python file here:

/some-folder/app/bin/file.py

我将其导入到我的代码中

I am importing it to my code here:

/some-folder2/app/code/file2.py

通过

import sys
sys.path.append('/some-folder/app/bin')
from file import some_class

clss = some_class()

我想在Spark地图中使用名为some_function的此类函数

I want to use this class's function named some_function in map of spark

sc.parallelize(some_data_iterator).map(lambda x: clss.some_function(x))

这给我一个错误:

No module named file

当我在pyspark的map函数之外调用它时的class.some_function,即正常情况下,但在pySpark的RDD中却没有.我认为这与pyspark有关.我不知道我在哪里出错了.

While class.some_function when I am calling it outside map function of pyspark i.e. normally but not in pySpark's RDD. I think this has something to do with pyspark. I have no idea where am I going wrong in this.

我尝试广播这堂课,但还是没用.

I tried broadcasting this class and still didn't work.

推荐答案

所有Python依赖项都必须存在于工作节点的搜索路径上,或者必须使用SparkContext.addPyFile方法手动分发,因此类似的事情应该可以解决:

All Python dependencies have to be either present on the search path of the worker nodes or distributed manually using SparkContext.addPyFile method so something like this should do the trick:

sc.addPyFile("/some-folder/app/bin/file.py")

它将文件复制到所有工作程序并放置在工作目录中.

It will copy the file to all the workers and place in the working directory.

在旁注中,即使仅作为示例,也请不要使用file作为模块名称.在Python中隐藏内置函数并不是一个好主意.

On a side note please don't use file as module name, even if it is only an example. Shadowing built-in functions in Python is not a very good idea.

这篇关于将类函数传递给PySpark RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆