Hadoop Map的前N个值减少代码 [英] Top N values by Hadoop Map Reduce code

查看:136
本文介绍了Hadoop Map的前N个值减少代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



任何人都可以告诉我如何通过使用字数计算示例来获得前n个值只有地图缩减代码技术?



我不想在这个简单的任务中使用任何hadoop命令。 解决方案

您有两个明显的选择:




有两个MapReduce作业:


  1. WordCount:对所有单词进行计数(几乎就是这个例子)
  2. TopN:一个MapReduce作业,这里有一些例子:源代码博文

让WordCount的输出写入HDFS。然后,让TopN读取该输出。这就是所谓的工作链,并且有许多方法可以解决这个问题:oozie,bash脚本,从你的驱动程序中发射两个工作,等等。

你需要的理由两个工作是你在做两个聚合:一个是字数,第二个是topN。通常在MapReduce中,每个汇总都需要它自己的MapReduce作业。

首先,让你的WordCount作业运行在数据上。然后,使用一些bash拉出前N个。

  hadoop fs -cat / output / of / wordcount / part * | sort -n -k2 -r |头-n20 

sort -n -k2 -r 表示按列#2按数字排序,按降序排列。 head -n20 取得前20名。



这对WordCount来说是更好的选择,因为WordCount可能只会输出数千或数万行,您不需要MapReduce作业。请记住,仅仅因为你有hadoop并不意味着你应该用Hadoop来解决你所有的问题。






明显的版本,这是棘手的,但混合了以上两种... ...

编写WordCount MapReduce作业,但是在Reducer中执行TopN MapReduce作业我早些时候给你看过。然后,让每个减速器只输出该减速器的TopN结果。



因此,如果您正在执行前10名,每个减速器将输出10个结果。假设你有30个减速器,你将输出300个结果。

然后,使用bash执行与#2选项相同的操作:

  hadoop fs -cat / output / of / wordcount / part * | sort -n -k2 -r | head -n10 

这应该会更快,因为您只处理了一小部分结果。



这是我能想到的最快的方式,但这可能不值得。


I am very new in hadoop world and struggling to achieve one simple task.

Can anybody please tell me how to get top n values for word count example by using only Map reduce code technique?

I do not want to use any hadoop command for this simple task.

解决方案

You have two obvious options:


Have two MapReduce jobs:

  1. WordCount: counts all the words (pretty much the example exactly)
  2. TopN: A MapReduce job that finds the top N of something (here are some examples: source code, blog post)

Have the output of WordCount write to HDFS. Then, have TopN read that output. This is called job chaining and there are a number of ways to solve this problem: oozie, bash scripts, firing two jobs from your driver, etc.

The reason you need two jobs is you are doing two aggregations: one is word count, and the second is topN. Typically in MapReduce each aggregation requires its own MapReduce job.


First, have your WordCount job run on the data. Then, use some bash to pull the top N out.

hadoop fs -cat /output/of/wordcount/part* | sort -n -k2 -r | head -n20

sort -n -k2 -r says "sort numerically by column #2, in descending order". head -n20 pulls the top twenty.

This is the better option for WordCount, just because WordCount will probably only output on the order of thousands or tens of thousands of lines and you don't need a MapReduce job for that. Remember that just because you have hadoop around doesn't mean you should solve all your problems with Hadoop.


One non-obvious version, which is tricky but a mix of both of the above...

Write a WordCount MapReduce job, but in the Reducer do something like in the TopN MapReduce jobs I showed you earlier. Then, have each reducer output only the TopN results from that reducer.

So, if you are doing Top 10, each reducer will output 10 results. Let's say you have 30 reducers, you'll output 300 results.

Then, do the same thing as in option #2 with bash:

hadoop fs -cat /output/of/wordcount/part* | sort -n -k2 -r | head -n10

This should be faster because you are only postprocessing a fraction of the results.

This is the fastest way I can think of doing this, but it's probably not worth the effort.

这篇关于Hadoop Map的前N个值减少代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆