在 Hadoop 中链接多个 MapReduce 作业 [英] Chaining multiple MapReduce jobs in Hadoop

查看:39
本文介绍了在 Hadoop 中链接多个 MapReduce 作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在许多实际应用 MapReduce 的情况下,最终的算法最终是几个 MapReduce 步骤.

In many real-life situations where you apply MapReduce, the final algorithms end up being several MapReduce steps.

即Map1、Reduce1、Map2、Reduce2 等.

i.e. Map1 , Reduce1 , Map2 , Reduce2 , and so on.

所以你有最后一个reduce的输出,它需要作为下一个地图的输入.

So you have the output from the last reduce that is needed as the input for the next map.

一旦管道成功完成,中间数据是您(通常)不想保留的东西.此外,由于这些中间数据通常是某种数据结构(例如地图"或集合"),因此您不想在编写和读取这些键值对时花费太多精力.

The intermediate data is something you (in general) do not want to keep once the pipeline has been successfully completed. Also because this intermediate data is in general some data structure (like a 'map' or a 'set') you don't want to put too much effort in writing and reading these key-value pairs.

在 Hadoop 中推荐的方法是什么?

What is the recommended way of doing that in Hadoop?

是否有一个(简单的)例子来展示如何以正确的方式处理这些中间数据,包括之后的清理?

Is there a (simple) example that shows how to handle this intermediate data in the correct way, including the cleanup afterward?

推荐答案

我认为雅虎开发者网络上的这个教程将帮助你解决这个问题:链接工作

I think this tutorial on Yahoo's developer network will help you with this: Chaining Jobs

您使用 JobClient.runJob().第一个作业的数据输出路径成为第二个作业的输入路径.这些需要作为参数传递给您的作业,并使用适当的代码来解析它们并为作业设置参数.

You use the JobClient.runJob(). The output path of the data from the first job becomes the input path to your second job. These need to be passed in as arguments to your jobs with appropriate code to parse them and set up the parameters for the job.

我认为上述方法可能是现在较旧的 mapred API 所做的方式,但它应该仍然有效.新的 mapreduce API 中会有类似的方法,但我不确定它是什么.

I think that the above method might however be the way the now older mapred API did it, but it should still work. There will be a similar method in the new mapreduce API but i'm not sure what it is.

至于在作业完成后删除中间数据,您可以在代码中执行此操作.我以前做过的方法是使用类似的东西:

As far as removing intermediate data after a job has finished you can do this in your code. The way i've done it before is using something like:

FileSystem.delete(Path f, boolean recursive);

其中路径是数据在 HDFS 上的位置.您需要确保仅在没有其他工作需要时才删除这些数据.

Where the path is the location on HDFS of the data. You need to make sure that you only delete this data once no other job requires it.

这篇关于在 Hadoop 中链接多个 MapReduce 作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆