星火RDD - 有额外的参数映射 [英] Spark RDD - Mapping with extra arguments
问题描述
是否有可能额外的参数传递给在pySpark映射函数?
具体来说,我有以下的code配方:
raw_data_rdd = sc.textFile(data.json,use_uni code = TRUE)
json_data_rdd = raw_data_rdd.map(拉姆达行:json.loads(线))
mapped_rdd = json_data_rdd.flatMap(processDataLine)
功能 processDataLine
需要额外的参数除了JSON对象,如:
高清processDataLine(数据线,ARG1,ARG2)
我如何传递额外的参数 ARG1
和 ARG2
到 flaMap
功能?
-
您可以直接在使用匿名函数
flatMap
json_data_rdd.flatMap(拉姆达记者:processDataLine(J,ARG1,ARG2))
或咖喱
processDataLine
F =拉姆达记者:processDataLine(数据线,ARG1,ARG2)
json_data_rdd.flatMap(F) -
您可以生成
processDataLine
是这样的:高清processDataLine(ARG1,ARG2):
高清_processDataLine(数据线):
返回...#不要用数据线,ARG1,ARG2东西
返回_processDataLinejson_data_rdd.flatMap(processDataLine(ARG1,ARG2)) -
toolz
库提供有用的咖喱
装饰:从toolz.functoolz进口咖喱@咖喱
高清processDataLine(ARG1,ARG2,数据线):
返回...#不要用数据线,ARG1,ARG2东西json_data_rdd.flatMap(processDataLine(ARG1,ARG2))请注意,我已经推
数据线
参数的最后一个位置。它不是必需的,但这种方式,我们不必使用关键字ARGS -
终于有<一个href=\"https://docs.python.org/2/library/functools.html#functools.partial\"><$c$c>functools.partial$c$c>已经 Avihoo Mamka 在评论中提到的。
Is it possible to pass extra arguments to the mapping function in pySpark? Specifically, I have the following code recipe:
raw_data_rdd = sc.textFile("data.json", use_unicode=True)
json_data_rdd = raw_data_rdd.map(lambda line: json.loads(line))
mapped_rdd = json_data_rdd.flatMap(processDataLine)
The function processDataLine
takes extra arguments in addition to the JSON object, as:
def processDataLine(dataline, arg1, arg2)
How can I pass the extra arguments arg1
and arg2
to the flaMap
function?
You can use an anonymous function either directly in a
flatMap
json_data_rdd.flatMap(lambda j: processDataLine(j, arg1, arg2))
or to curry
processDataLine
f = lambda j: processDataLine(dataline, arg1, arg2) json_data_rdd.flatMap(f)
You can generate
processDataLine
like this:def processDataLine(arg1, arg2): def _processDataLine(dataline): return ... # Do something with dataline, arg1, arg2 return _processDataLine json_data_rdd.flatMap(processDataLine(arg1, arg2))
toolz
library provides usefulcurry
decorator:from toolz.functoolz import curry @curry def processDataLine(arg1, arg2, dataline): return ... # Do something with dataline, arg1, arg2 json_data_rdd.flatMap(processDataLine(arg1, arg2))
Note that I've pushed
dataline
argument to the last position. It is not required but this way we don't have to use keyword args.Finally there is
functools.partial
already mentioned by Avihoo Mamka in the comments.
这篇关于星火RDD - 有额外的参数映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!