在 Hadoop 流中链接多个 mapreduce 任务 [英] Chaining multiple mapreduce tasks in Hadoop streaming
问题描述
我处于有两个 mapreduce 作业的场景中.我对 python 更熟悉,并计划使用它来编写 mapreduce 脚本并使用 hadoop 流进行相同的操作.使用 hadoop 流式传输时,是否可以方便地将以下两种作业链接起来?
I am in scenario where I have two mapreduce jobs. I am more comfortable with python and planning to use it for writing mapreduce scripts and use hadoop streaming for the same. is there a convenient to chain both the jobs following form when hadoop streaming is used?
Map1 -> Reduce1 -> Map2 -> Reduce2
Map1 -> Reduce1 -> Map2 -> Reduce2
我听说过很多在 Java 中实现此目的的方法,但我需要一些用于 Hadoop 流的东西.
I've heard a lot of methods to accomplish this in java, But i need something for Hadoop streaming.
推荐答案
这里有一篇很棒的博客文章,介绍了如何使用级联和流式传输.http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/
Here is a great blog post on how to use Cascading and Streaming. http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/
这里的价值是您可以在同一个应用程序中将 Java(级联查询流)与您的自定义流操作混合使用.我发现这比其他方法更不脆弱.
The value here is you can mix java (Cascading query flows) with your custom streaming operations in the same app. I find this much less brittle than other methods.
请注意,Cascading 中的 Cascade 对象允许您链接多个 Flow(通过上述博客文章,您的 Streaming 作业将成为 MapReduceFlow).
Note, the Cascade object in Cascading allows you to chain multiple Flows (via the above blog post your Streaming job would become a MapReduceFlow).
免责声明:我是 Cascading 的作者
Disclaimer: I'm the author of Cascading
这篇关于在 Hadoop 流中链接多个 mapreduce 任务的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!