在减速器完成时调用映射器 [英] Call mapper when reducer is done

查看:154
本文介绍了在减速器完成时调用映射器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在执行这项工作:

  hadoop / bin /./ hadoop jar / home / hadoopuser / hadoop / share /hadoop/tools/lib/hadoop-streaming-2.6.0.jar -D mapred.reduce.tasks = 2 -file kmeans_mapper.py -mapper kmeans_mapper.py -file kmeans_reducer.py \ 
-reducer kmeans_reducer。 py -input gutenberg / small_train.csv -output gutenberg / out

当两个reducer完成后,我想要对结果做些什么,所以理想我想调用另一个文件(另一个映射器?),它将接收减速器的输出作为其输入。如何轻松做到这一点?



我检查了这个 blog ,里面有一个例子,这个没有解释,我不知道该怎么做。



MapReduce教程指出:


用户可能需要链接MapReduce作业来完成无法通过单一MapReduce作业完成的复杂任务。这很容易,因为作业的输出通常进入分布式文件系统,而输出又可以用作下一项作业的输入。


但它没有给出任何示例...



这里是 Java中的一些代码我可以理解,但我正在编写Python! :/ b
$ b




这个问题揭示了一些 指示灯:在Hadoop流中联合多个mapreduce任务

解决方案

正如您找到的示例所示,您可以使用Java API执行您的要求。 但是,您正在使用流API,它只是读取标准输入并写入标准输出。除了 hadoop jar 命令的完成之外,没有回调说明mapreduce作业何时完成。但是,因为它完成了,并不真正表示成功。话虽如此,但如果没有更多关于流媒体API的工具,这是不可能的。

如果输出写入本地终端而不是写入HDFS,可能会将输出传输到另一个流式作业的输入中,但不幸的是,输入并输出到蒸汽罐需要HDFS上的路径。

I am executing the job as:

hadoop/bin/./hadoop jar /home/hadoopuser/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar  -D mapred.reduce.tasks=2 -file kmeans_mapper.py    -mapper kmeans_mapper.py -file kmeans_reducer.py \
-reducer kmeans_reducer.py -input gutenberg/small_train.csv -output gutenberg/out

When the two reducers are done, I would like to do something with the results, so ideally I would like to call another file (another mapper?) which would receive the output of the reducers as its input. How to do that easily?

I checked this blog which has a Mrjob example, which doesn't explain, I do not get how to do mine.

The MapReduce tutorial states:

Users may need to chain MapReduce jobs to accomplish complex tasks which cannot be done via a single MapReduce job. This is fairly easy since the output of the job typically goes to distributed file-system, and the output, in turn, can be used as the input for the next job.

but it doesn't give any example...

Here is some code in Java I could understand, but I am writing Python! :/


This question sheds some light: Chaining multiple mapreduce tasks in Hadoop streaming

解决方案

It is possible to do what you're asking for using the Java API as you've found an example for.

But, you are using the streaming API which simply reads standard in and writes to standard out. There is no callback to say when a mapreduce job has completed other than the completion of the hadoop jar command. But, because it completed, doesn't really indicate a "success". That being said, it really isn't possible without some more tooling around the streaming API.

If the output was written to the local terminal rather than to HDFS, it might be possible to pipe that output into the input of another streaming job, but unfortunately, the inputs and outputs to the steaming jar require paths on HDFS.

这篇关于在减速器完成时调用映射器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆