星火RDD - 有额外的参数映射 [英] Spark RDD - Mapping with extra arguments

查看:166
本文介绍了星火RDD - 有额外的参数映射的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有可能额外的参数传递给在pySpark映射函数?
具体来说,我有以下的code配方:

  raw_data_rdd = sc.textFile(data.json,use_uni code = TRUE)
json_data_rdd = raw_data_rdd.map(拉姆达行:json.loads(线))
mapped_rdd = json_data_rdd.flatMap(processDataLine)

功能 processDataLine 需要额外的参数除了JSON对象,如:

 高清processDataLine(数据线,ARG1,ARG2)

我如何传递额外的参数 ARG1 ARG2 flaMap 功能?


解决方案

  1. 您可以直接在使用匿名函数 flatMap

      json_data_rdd.flatMap(拉姆达记者:processDataLine(J,ARG1,ARG2))

    或咖喱 processDataLine

      F =拉姆达记者:processDataLine(数据线,ARG1,ARG2)
    json_data_rdd.flatMap(F)


  2. 您可以生成 processDataLine 是这样的:

     高清processDataLine(ARG1,ARG2):
        高清_processDataLine(数据线):
            返回...#不要用数据线,ARG1,ARG2东西
        返回_processDataLinejson_data_rdd.flatMap(processDataLine(ARG1,ARG2))


  3. toolz 库提供有用的咖喱装饰:

     从toolz.functoolz进口咖喱@咖喱
    高清processDataLine(ARG1,ARG2,数据线):
        返回...#不要用数据线,ARG1,ARG2东西json_data_rdd.flatMap(processDataLine(ARG1,ARG2))

    请注意,我已经推数据线参数的最后一个位置。它不是必需的,但这种方式,我们不必使用关键字ARGS


  4. 终于有<一个href=\"https://docs.python.org/2/library/functools.html#functools.partial\"><$c$c>functools.partial已经 Avihoo Mamka 在评论中提到的。


Is it possible to pass extra arguments to the mapping function in pySpark? Specifically, I have the following code recipe:

raw_data_rdd = sc.textFile("data.json", use_unicode=True)
json_data_rdd = raw_data_rdd.map(lambda line: json.loads(line))
mapped_rdd = json_data_rdd.flatMap(processDataLine)

The function processDataLine takes extra arguments in addition to the JSON object, as:

def processDataLine(dataline, arg1, arg2)

How can I pass the extra arguments arg1 and arg2 to the flaMap function?

解决方案

  1. You can use an anonymous function either directly in a flatMap

    json_data_rdd.flatMap(lambda j: processDataLine(j, arg1, arg2))
    

    or to curry processDataLine

    f = lambda j: processDataLine(dataline, arg1, arg2)
    json_data_rdd.flatMap(f)
    

  2. You can generate processDataLine like this:

    def processDataLine(arg1, arg2):
        def _processDataLine(dataline):
            return ... # Do something with dataline, arg1, arg2
        return _processDataLine
    
    json_data_rdd.flatMap(processDataLine(arg1, arg2))
    

  3. toolz library provides useful curry decorator:

    from toolz.functoolz import curry
    
    @curry
    def processDataLine(arg1, arg2, dataline): 
        return ... # Do something with dataline, arg1, arg2
    
    json_data_rdd.flatMap(processDataLine(arg1, arg2))
    

    Note that I've pushed dataline argument to the last position. It is not required but this way we don't have to use keyword args.

  4. Finally there is functools.partial already mentioned by Avihoo Mamka in the comments.

这篇关于星火RDD - 有额外的参数映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆