在 Hadoop 流中链接多个 mapreduce 任务 [英] Chaining multiple mapreduce tasks in Hadoop streaming

查看:37
本文介绍了在 Hadoop 流中链接多个 mapreduce 任务的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我处于有两个 mapreduce 作业的场景中.我对 python 更熟悉,并计划使用它来编写 mapreduce 脚本并使用 hadoop 流进行相同的操作.使用 hadoop 流式传输时,是否可以方便地将以下两种作业链接起来?

I am in scenario where I have two mapreduce jobs. I am more comfortable with python and planning to use it for writing mapreduce scripts and use hadoop streaming for the same. is there a convenient to chain both the jobs following form when hadoop streaming is used?

Map1 -> Reduce1 -> Map2 -> Reduce2

Map1 -> Reduce1 -> Map2 -> Reduce2

我听说过很多在 Java 中实现此目的的方法,但我需要一些用于 Hadoop 流的东西.

I've heard a lot of methods to accomplish this in java, But i need something for Hadoop streaming.

推荐答案

这里有一篇很棒的博客文章,介绍了如何使用级联和流式传输.http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

Here is a great blog post on how to use Cascading and Streaming. http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

这里的价值是您可以在同一个应用程序中将 Java(级联查询流)与您的自定义流操作混合使用.我发现这比其他方法更不脆弱.

The value here is you can mix java (Cascading query flows) with your custom streaming operations in the same app. I find this much less brittle than other methods.

请注意,Cascading 中的 Cascade 对象允许您链接多个 Flow(通过上述博客文章,您的 Streaming 作业将成为 MapReduceFlow).

Note, the Cascade object in Cascading allows you to chain multiple Flows (via the above blog post your Streaming job would become a MapReduceFlow).

免责声明:我是 Cascading 的作者

Disclaimer: I'm the author of Cascading

这篇关于在 Hadoop 流中链接多个 mapreduce 任务的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆