Hadoop Map Reduce 代码的前 N ​​个值 [英] Top N values by Hadoop Map Reduce code

查看:21
本文介绍了Hadoop Map Reduce 代码的前 N ​​个值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 hadoop 世界的新手,正在努力完成一项简单的任务.

I am very new in hadoop world and struggling to achieve one simple task.

谁能告诉我如何通过仅使用 Map reduce 代码技术来获取字数示例的前 n 个值?

Can anybody please tell me how to get top n values for word count example by using only Map reduce code technique?

我不想为这个简单的任务使用任何 hadoop 命令.

I do not want to use any hadoop command for this simple task.

推荐答案

你有两个明显的选择:

有两个 MapReduce 作业:

Have two MapReduce jobs:

  1. WordCount:统计所有单词(几乎完全是示例)
  2. TopN:一个 MapReduce 作业,用于查找某事物的前 N ​​个(以下是一些示例:源代码, 博文)

让 WordCount 的输出写入 HDFS.然后,让 TopN 读取该输出.这称为作业链接,有多种方法可以解决此问题:oozie、bash 脚本、从驱动程序中触发两个作业等.

Have the output of WordCount write to HDFS. Then, have TopN read that output. This is called job chaining and there are a number of ways to solve this problem: oozie, bash scripts, firing two jobs from your driver, etc.

你需要两个工作的原因是你在做两个聚合:一个是字数,第二个是 topN.通常在 MapReduce 中,每个聚合都需要自己的 MapReduce 作业.

The reason you need two jobs is you are doing two aggregations: one is word count, and the second is topN. Typically in MapReduce each aggregation requires its own MapReduce job.

首先,让您的 WordCount 作业在数据上运行.然后,使用一些 bash 将前 N 个拉出来.

First, have your WordCount job run on the data. Then, use some bash to pull the top N out.

hadoop fs -cat /output/of/wordcount/part* | sort -n -k2 -r | head -n20

sort -n -k2 -r 表示按第 2 列的数字降序排列".head -n20 拉到前二十.

sort -n -k2 -r says "sort numerically by column #2, in descending order". head -n20 pulls the top twenty.

这是 WordCount 的更好选择,因为 WordCount 可能只会输出数千或数万行的量级,而您不需要 MapReduce 作业.请记住,仅仅因为您使用了 hadoop 并不意味着您应该使用 Hadoop 解决所有问题.

This is the better option for WordCount, just because WordCount will probably only output on the order of thousands or tens of thousands of lines and you don't need a MapReduce job for that. Remember that just because you have hadoop around doesn't mean you should solve all your problems with Hadoop.

一个不明显的版本,它很棘手,但混合了上述两种......

One non-obvious version, which is tricky but a mix of both of the above...

编写一个 WordCount MapReduce 作业,但在 Reducer 中执行类似于我之前向您展示的 TopN MapReduce 作业中的操作.然后,让每个 reducer 仅输出该 reducer 的 TopN 结果.

Write a WordCount MapReduce job, but in the Reducer do something like in the TopN MapReduce jobs I showed you earlier. Then, have each reducer output only the TopN results from that reducer.

所以,如果你在做 Top 10,每个 reducer 会输出 10 个结果.假设你有 30 个 reducer,你会输出 300 个结果.

So, if you are doing Top 10, each reducer will output 10 results. Let's say you have 30 reducers, you'll output 300 results.

然后,使用 bash 执行与选项 #2 中相同的操作:

Then, do the same thing as in option #2 with bash:

hadoop fs -cat /output/of/wordcount/part* | sort -n -k2 -r | head -n10

这应该会更快,因为您只需要对一小部分结果进行后处理.

This should be faster because you are only postprocessing a fraction of the results.

这是我能想到的最快的方法,但可能不值得这样做.

This is the fastest way I can think of doing this, but it's probably not worth the effort.

这篇关于Hadoop Map Reduce 代码的前 N ​​个值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆