如何“分类"?从30个最常用的字眼衍生而来? [英] How to "sort" descending from the 30 most frequent words?

查看:41
本文介绍了如何“分类"?从30个最常用的字眼衍生而来?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的映射器(Hadoop 1.2.1)创建了令牌的键-值对,这些令牌是从一个简单的文本文件中读取的.没有火箭科学.最后,reducer将相同的键捆绑"(在Hadoop中,您是否像SQL中那样称呼该分组?)相同的键,并且还将值1求和.这是默认的Hadoop教程.

My mapper (Hadoop 1.2.1) creates key-values pairs of tokens, which I read from a simple text file. No rocket science. The reducer finally "bundles" (in Hadoop, do you call that grouping like in SQL?) the same keys and also sums the values of 1. This is the default Hadoop tutorial.

但是,当这些值可用于我的reducer时,我想将它们全部降序排列.仅显示前30个标记(字符串,单词).

However, when these values are available to my reducer, I want to sort all of them descending. Only displaying the top 30 tokens (strings, words).

似乎有些概念对我来说还不清楚.

It seems like some concepts are not clear to me.

  • 首先,对每个键值对调用 reduce 方法,对吗?因此,我看不到要缓冲诸如HashMap之类的东西的地方,该地方可以保存最高的结果(最常使用的标记).
  • 我当时想,如果我有这样一个变量,我可以轻松地比较并插入前30名中具有值的每个键.什么是处理此频率排名任务的合适方法?

  • First, the reduce method is invoked for every key-value pair, right? Thus, I don't see a place to buffer something like a HashMap, which could hold the top results (most frequent tokens).
  • I was thinking that if I had such a variable, I could easily compar and insert every key that has a value within the top 30. What is the appropriate approach to handle this frequency-ranking task?

public static class Reduce extends MapReduceBase implements
            Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,
        OutputCollector<Text, IntWritable> output, Reporter reporter)
        throws IOException {

    int sum = 0;
    while (values.hasNext()) {
        sum += values.next().get();
    }
            // CURRENTLY I SIMPLY OUTPUT THE KEY AND THE SUM.
            // IN THIS PLACE, HOW COULD YOU STORE E.G. A HASHMAP THAT
            // COULD STORE THE TOP 30?
    output.collect(key, new IntWritable(sum));
    LOG.info("REDUCE: added to output:: key: " + key.toString());
}

}

推荐答案

首先,对每个键值对调用reduce方法,对吗?因此,我看不到要缓冲诸如HashMap之类的东西的地方,可以保持最高的搜索结果(最常见的令牌).

First, the reduce method is invoked for every key-value pair, right? Thus, I don't see a place to buffer something like a HashMap, which could hold the top results (most frequent tokens).

有点细微差别: reduce 方法对每个键(而不是键/值对)运行一次.每个带有该键的值都显示在 Iterator 中.如果要存储 HashMap ,则可以在 setup 函数中进行设置(或将其设置为私有对象),在reduce函数中与之交互,然后在 cleanup 函数中执行此操作.因此绝对有可能在调用 reduce 的过程中保持状态.

A bit of a nuance: the reduce method is ran once per key, not key-value pair. Each value with that key is presented in the Iterator. If you want to store a HashMap, you can set that up in the setup function (or make it a private object), interact with it in the reduce function, and then do whatever with it in the cleanup function. So it is definitely possible to maintain state across calls to reduce.

不过,我认为您也许可以以更聪明的方式解决问题.我多次写过关于前十名的列表,只是因为我发现它们很有趣并且它们是非常有用的工具.我希望前30名与前10名之间的关系很明显.

I think you might be able to solve your problem in a bit more clever way, however. I've written about top-ten lists a number of times, just because I find them interesting and they are very useful tools. I hope it's obvious how top-30 relates to top-10.

  • 此处是前十名列表生成器的一个例子我写了一段时间,可以适应您的问题.您可能可以稍微改变解决问题的方式以适应此模式.在我的代码中,我使用了 TreeMap 而不是 HashMap ,因为 TreeMap 可以使事物保持排序.找到31个项目后,弹出频率最低的项目.

  • Here is an example of a top-ten list generator I wrote a while back that can be adapted to your problem. You may be able to change a bit how your are solving your problem to fit this pattern. In my code I use a TreeMap instead of a HashMap, because the TreeMap keeps the things in sorted order. Once you get to 31 items, pop off the one with the lowest frequency.

我还讨论了MapReduce设计模式一书中的前十个模式(对不起的插件很抱歉).

I also discuss the top-ten pattern in the book MapReduce Design patterns (sorry for the shameless plug).

查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆