当地职能PySpark广播变量 [英] PySpark broadcast variables from local functions

查看:1438
本文介绍了当地职能PySpark广播变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从Python的方法中创建广播变量(试图抽象一些实用的方法,我创建依赖于分布式操作)。不过,我似乎无法从星火工人中收听广播的变量。

让我们说我有这样的设置:

  DEF的main():
    SC = SparkContext()
    的someMethod(SC)高清的someMethod(SC):
    someValue中= RAND()
    V = sc.broadcast(someValue中)
    A = sc.parallelize()。图(工人)高清工人(元):
    元* = V.value ### NameError:全局名称'V'是没有定义###

不过,如果我不是消除的someMethod()中间人,它工作正常。

  DEF的main():
    SC = SparkContext()
    someValue中= RAND()
    V = sc.broadcast(someValue中)
    A = sc.parallelize()。图(工人)高清工人(元):
    元* = V.value#工作得很好

我宁愿没有把我所有的星火逻辑的主要方法,如果我能。有没有办法从本地函数内广播的变量,并让它们是全局可见的星火工人呢?

此外,这将是对这种情况一个好的设计模式 - 例如,我想写一个方法专门为星火这是自包含的,并执行特定的功能,我想重新使用<? / p>

解决方案

我不知道我完全理解这个问题,但如果你需要的工人在 V 对象功能你,那么你绝对应该把它作为一个参数,否则该方法是不是真的自我包含的:

 高清工人(V,元素):
    元* = V.value

现在,为了使用它在地图中的功能,你需要使用一个部分,因此地图只能看到一个1参数功能:

 从functools部分进口高清的someMethod(SC):
    someValue中= RAND()
    V = sc.broadcast(someValue中)
    A = sc.parallelize()。图(部分(工人,V = V))

I'm attempting to create broadcast variables from within Python methods (trying to abstract some utility methods I'm creating that rely on distributed operations). However, I can't seem to access the broadcast variables from within the Spark workers.

Let's say I have this setup:

def main():
    sc = SparkContext()
    SomeMethod(sc)

def SomeMethod(sc):
    someValue = rand()
    V = sc.broadcast(someValue)
    A = sc.parallelize().map(worker)

def worker(element):
    element *= V.value  ### NameError: global name 'V' is not defined ###

However, if I instead eliminate the SomeMethod() middleman, it works fine.

def main():
    sc = SparkContext()
    someValue = rand()
    V = sc.broadcast(someValue)
    A = sc.parallelize().map(worker)

def worker(element):
    element *= V.value   # works just fine

I'd rather not have to put all my Spark logic in the main method, if I can. Is there any way to broadcast variables from within local functions and have them be globally visible to the Spark workers?

Alternatively, what would be a good design pattern for this kind of situation--e.g., I want to write a method specifically for Spark which is self-contained and performs a specific function I'd like to re-use?

解决方案

I am not sure I completely understood the question but, if you need the V object inside the worker function you then you definitely should pass it as a parameter, otherwise the method is not really self-contained:

def worker(V, element):
    element *= V.value

Now in order to use it in map functions you need to use a partial, so that map only sees a 1 parameter function:

from functools import partial

def SomeMethod(sc):
    someValue = rand()
    V = sc.broadcast(someValue)
    A = sc.parallelize().map(partial(worker, V=V))

这篇关于当地职能PySpark广播变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆