MapReduce查找字长频率 [英] MapReduce find word length frequency

查看:193
本文介绍了MapReduce查找字长频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是MapReduce的新手,我想问一下是否有人可以使用MapReduce给我一个执行字长的频率的想法。我已经有了字数的代码,但我想使用字长,这是我到目前为止所做的。

I am new in MapReduce and I wanted to ask if someone can give me an idea to perform word length frequency using MapReduce. I've already have the code for word count but I wanted to use word length, this is what I've got so far.

public class WordCount  {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        context.write(word, one);
    }
}

}

谢谢......

推荐答案

对于字长频率, tokenizer.nextToken() 不应以的形式发出。实际上要考虑该字符串的长度。因此,只需进行以下更改,您的代码就可以正常运行:

For word length frequency, tokenizer.nextToken() shouldn't be emit as key. The length of that string actually be considered. So your code will do fine with just the following change and is sufficient :

word.set( String.valueOf( tokenizer.nextToken().length() ));  

现在,如果你深入了解,你会发现 Mapper 输出键不应再是 Text 尽管它有效。更好地使用 IntWritable 键:

Now if you give deep look, you will realize that Mapper output key should no longer be Text although it works. Better use an IntWritable key instead :

public static class Map extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private IntWritable wordLength = new IntWritable();

    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            wordLength.set(tokenizer.nextToken().length());
            context.write(wordLength, one);
        }
    }
}

虽然大多数<$使用 StringTokenizer ,使用 String.split 方法。因此,相应地进行更改。

Although most of the MapReduce examples use StringTokenizer, it's cleaner and advisable to use String.split method. So make the changes accordingly.

这篇关于MapReduce查找字长频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆