如何具体确定MRJob中每个地图步骤的输入? [英] How to specifically determine input for each map step in MRJob?

查看:93
本文介绍了如何具体确定MRJob中每个地图步骤的输入?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个map-reduce作业,包含多个步骤。每个步骤使用mrjob接收前一步输出。问题是我不希望它。



我想要的是提取一些信息并在第二步中针对所有输入使用它等等。是否可以使用mrjob执行此操作?



注意:由于我不想使用emr,

更新:如果无法在单个作业中执行此操作,则需要在两个单独的作业中完成此操作。在这种情况下,是否有任何方法来包装这两个工作,并管理中间outpus等? 您可以使用跑步者



你将不得不单独定义作业并使用另一个python脚本来调用它。

  from NumLines从WordsPerLine导入NumLines 
import WordsPerLine
导入sys

中间=无

def firstJob(input_file):
全局中间
mr_job = NumLines(args = [input_file])
with mr_job.make_runner()as runner:
runner.run()
intermediate = runner.get_output_dir()
$ b $ def secondJob(input_file):
mr_job = WordsPerLine(args = [intermediate,input_file])
with mr_job.make_runner()as runner:
runner.run()
$ b $ if if __name__ =='__main__':
firstJob(sys.argv [1] )
secondJob(sys.argv [1])$ ​​b $ b

可以通过以下方式调用:

  python main_script.py input.txt 


I am working on a map-reduce job, consisting multiple steps. Using mrjob every step receives previous step output. The problem is I don't want it to.

What I want is to extract some information and use it in second step against all input and so on. Is it possible to do this using mrjob?

Note: Since I don't want to use emr, this question is not much of help to me.

UPDATE: If it would not be possible to do this on a single job, I need to do it in two separate jobs. In this case, is there any way to wrap these two jobs and manage intermediate outpus, etc?

解决方案

You can use Runners

You will have to define the jobs separately and use another python script to invoke it.

from NumLines import NumLines
from WordsPerLine import WordsPerLine
import sys

intermediate = None

def firstJob(input_file):
    global intermediate
    mr_job = NumLines(args=[input_file])
    with mr_job.make_runner() as runner:
        runner.run()
        intermediate = runner.get_output_dir()

def secondJob(input_file):
    mr_job = WordsPerLine(args=[intermediate,input_file])
    with mr_job.make_runner() as runner:
        runner.run()

if __name__ == '__main__':
    firstJob(sys.argv[1]) 
    secondJob(sys.argv[1])

and can be invoked by:

python main_script.py input.txt

这篇关于如何具体确定MRJob中每个地图步骤的输入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆