当地职能PySpark广播变量 [英] PySpark broadcast variables from local functions
问题描述
我试图从Python的方法中创建广播变量(试图抽象一些实用的方法,我创建依赖于分布式操作)。不过,我似乎无法从星火工人中收听广播的变量。
让我们说我有这样的设置:
DEF的main():
SC = SparkContext()
的someMethod(SC)高清的someMethod(SC):
someValue中= RAND()
V = sc.broadcast(someValue中)
A = sc.parallelize()。图(工人)高清工人(元):
元* = V.value ### NameError:全局名称'V'是没有定义###
不过,如果我不是消除的someMethod()
中间人,它工作正常。
DEF的main():
SC = SparkContext()
someValue中= RAND()
V = sc.broadcast(someValue中)
A = sc.parallelize()。图(工人)高清工人(元):
元* = V.value#工作得很好
我宁愿没有把我所有的星火逻辑的主要方法,如果我能。有没有办法从本地函数内广播的变量,并让它们是全局可见的星火工人呢?
此外,这将是对这种情况一个好的设计模式 - 例如,我想写一个方法专门为星火这是自包含的,并执行特定的功能,我想重新使用<? / p>
我不知道我完全理解这个问题,但如果你需要的工人在 V
对象功能你,那么你绝对应该把它作为一个参数,否则该方法是不是真的自我包含的:
高清工人(V,元素):
元* = V.value
现在,为了使用它在地图中的功能,你需要使用一个部分,因此地图只能看到一个1参数功能:
从functools部分进口高清的someMethod(SC):
someValue中= RAND()
V = sc.broadcast(someValue中)
A = sc.parallelize()。图(部分(工人,V = V))
I'm attempting to create broadcast variables from within Python methods (trying to abstract some utility methods I'm creating that rely on distributed operations). However, I can't seem to access the broadcast variables from within the Spark workers.
Let's say I have this setup:
def main():
sc = SparkContext()
SomeMethod(sc)
def SomeMethod(sc):
someValue = rand()
V = sc.broadcast(someValue)
A = sc.parallelize().map(worker)
def worker(element):
element *= V.value ### NameError: global name 'V' is not defined ###
However, if I instead eliminate the SomeMethod()
middleman, it works fine.
def main():
sc = SparkContext()
someValue = rand()
V = sc.broadcast(someValue)
A = sc.parallelize().map(worker)
def worker(element):
element *= V.value # works just fine
I'd rather not have to put all my Spark logic in the main method, if I can. Is there any way to broadcast variables from within local functions and have them be globally visible to the Spark workers?
Alternatively, what would be a good design pattern for this kind of situation--e.g., I want to write a method specifically for Spark which is self-contained and performs a specific function I'd like to re-use?
I am not sure I completely understood the question but, if you need the V
object inside the worker function you then you definitely should pass it as a parameter, otherwise the method is not really self-contained:
def worker(V, element):
element *= V.value
Now in order to use it in map functions you need to use a partial, so that map only sees a 1 parameter function:
from functools import partial
def SomeMethod(sc):
someValue = rand()
V = sc.broadcast(someValue)
A = sc.parallelize().map(partial(worker, V=V))
这篇关于当地职能PySpark广播变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!