Hadoop的/马preduce - 优化"前N"字数马preduce招聘 [英] Hadoop / MapReduce - Optimizing "Top N" Word Count MapReduce Job

查看:223
本文介绍了Hadoop的/马preduce - 优化"前N"字数马preduce招聘的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做类似的规范麻preduce例如东西 - 字数,但在一个扭曲我期待只得到了前N 的结果。

I'm working on something similar to the canonical MapReduce example - the word count, but with a twist in that I'm looking to only get the Top N results.

比方说,我有一个非常大集在HDFS文本数据。有很多,显示如何建立一个Hadoop的麻preduce工作,将为您提供在该文本每一个字一个字计数的例子。例如,如果我的文集是:

Let's say I have a very large set of text data in HDFS. There are plenty of examples that show how to build a Hadoop MapReduce job that will provide you with a word count for every word in that text. For example, if my corpus is:

这是测试数据的测试和一个很好的测试此

"This is a test of test data and a good one to test this"

结果与标准麻preduce字数职位设置是:

The result set from the standard MapReduce word count job would be:

测试:3,:2,这样的:2是:1,等。

test:3, a:2, this:2, is: 1, etc..

但是,如果我的想在我的整个数据集已使用的前3个字?

But what if I ONLY want to get the Top 3 words that were used in my entire set of data?

我仍然可以运行完全相同的标准马preduce字计数作业,然后只取​​前3名的结果,一旦它已准备就绪,并吐出的每一个字计数,但似乎有点低效,因为需要大量的数据要在洗牌阶段到处移动。

I can still run the exact same standard MapReduce word-count job, and then just take the Top 3 results once it is ready and is spitting out the count for EVERY word, but that seems a little inefficient, because a lot of data needs to be moved around during the shuffle phase.

我在想什么是,如果此样本足够大,而数据是良好随机分布良好HDFS中,每个映射器并不需要所有的其字计数发送到减速器,而是只有一些顶级数据。所以,如果一个映射器这样的:

What I'm thinking is that, if this sample is large enough, and the data is well randomly and well distributed in HDFS, that each Mapper does not need to send ALL of its word counts to the Reducers, but rather, only some of the top data. So if one mapper has this:

答:8234的:5422,哥们:4352,......的更多的字的...,rareword:1,weirdword:1,等

a:8234, the: 5422, man: 4352, ...... many more words ... , rareword: 1, weirdword: 1, etc.

那是什么我想要做的只是从每个映射到减速阶段发送前100字左右 - 因为很少有机会rareword会突然结束了前3名当一切都说过而完成的。这似乎是它可以节省带宽,并在减速的处理时间。

Then what I'd like to do is only send the Top 100 or so words from each Mapper to the Reducer phase - since there is very little chance that "rareword" will suddenly end up in the Top 3 when all is said and done. This seems like it would save on bandwidth and also on Reducer processing time.

这个问题能在合并阶段做了什么?先于洗牌阶段常见的做这种优化的?

Can this be done in the Combiner phase? Is this sort of optimization prior to the shuffle phase commonly done?

推荐答案

这是一个很好的问题,因为你已经打了Hadoop的话效率低下计数的例子。

This is a very good question, because you have hit the inefficiency of Hadoop's word count example.

要优化你的问题的招数如下:

The tricks to optimize your problem are the following:

做一个的HashMap 基于分组在本地地图阶段,你也可以使用一个组合了点。这可以是这样的,我使用了 HashMultiSet 番石榴,型,便于一个很好的计数机制的

Do a HashMap based grouping in your local map stage, you can also use a combiner for that. This can look like this, I'm using the HashMultiSet of Guava, which faciliates a nice counting mechanism.

    public static class WordFrequencyMapper extends
      Mapper<LongWritable, Text, Text, LongWritable> {

    private final HashMultiset<String> wordCountSet = HashMultiset.create();

    @Override
    protected void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {

      String[] tokens = value.toString().split("\\s+");
      for (String token : tokens) {
        wordCountSet.add(token);
      }
    }

和你散发出的结果在清理阶段:

And you emit the result in your cleanup stage:

@Override
protected void cleanup(Context context) throws IOException,
    InterruptedException {
  Text key = new Text();
  LongWritable value = new LongWritable();
  for (Entry<String> entry : wordCountSet.entrySet()) {
    key.set(entry.getElement());
    value.set(entry.getCount());
    context.write(key, value);
  }
}

所以,你必须分组词语的工作局部块,从而减少网络使用通过使用位的RAM。你也可以用做同样的,但它的排序,以集团 - 所以这将是更慢(尤其是对字符串!)比使用 HashMultiset

So you have grouped the words in a local block of work, thus reducing network usage by using a bit of RAM. You can also do the same with a Combiner, but it is sorting to group- so this would be slower (especially for strings!) than using a HashMultiset.

要只得到了前N个,你只需要编写的前N个在当地 HashMultiset 到输出收集和汇总的结果,你的正常方式减少的一面。 这样可以节省您大量的网络带宽为好,唯一的缺点是,你需要在你的清除方法字数元组进行排序。

To just get the Top N, you will only have to write the Top N in that local HashMultiset to the output collector and aggregate the results in your normal way on the reduce side. This saves you a lot of network bandwidth as well, the only drawback is that you need to sort the word-count tuples in your cleanup method.

的code的一部分可能是这样的:

A part of the code might look like this:

  Set<String> elementSet = wordCountSet.elementSet();
  String[] array = elementSet.toArray(new String[elementSet.size()]);
  Arrays.sort(array, new Comparator<String>() {

    @Override
    public int compare(String o1, String o2) {
      // sort descending
      return Long.compare(wordCountSet.count(o2), wordCountSet.count(o1));
    }

  });
  Text key = new Text();
  LongWritable value = new LongWritable();
  // just emit the first n records
  for(int i = 0; i < N, i++){
    key.set(array[i]);
    value.set(wordCountSet.count(array[i]));
    context.write(key, value);
  }

希望你做尽可能多的单词本地的要点,然后就聚集的前N的前N;)

Hope you get the gist of doing as much of the word locally and then just aggregate the top N of the top N's ;)

这篇关于Hadoop的/马preduce - 优化&QUOT;前N&QUOT;字数马preduce招聘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆