文件的字数统计常用字 [英] Wordcount common words of files

查看:98
本文介绍了文件的字数统计常用字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我设法以非分布式模式运行Hadoop wordcount示例;我在名为"part-00000"的文件中得到输出;我可以看到它列出了所有输入文件组合的所有单词.

I Have managed to run the Hadoop wordcount example in a non-distributed mode; I get the output in a file named "part-00000"; I can see that it lists all words of all input files combined.

在跟踪单词计数代码之后,我可以看到它占用了行并根据空格对单词进行了分割.

After tracing the wordcount code I can see that it takes lines and splits the words based on spaces.

我正在尝试一种方法,仅列出在多个文件中出现的单词及其出现的方式?可以在Map/Reduce中实现? -添加- 这些更改合适吗?

I am trying to think of a way to just list the words that have occurred in multiple files and their occurrences? can this be achieved in Map/Reduce? -Added- Are these changes appropriate?

      //changes in the parameters here

    public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text> {

         // These are the original line; I am not using them but left them here...
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();

                    //My changes are here too

        private Text outvalue=new Text();
        FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
        private String filename = fileSplit.getPath().getName();;



      public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());

          //    And here        
              outvalue.set(filename);
          output.collect(word, outvalue);

        }

      }

    }

推荐答案

您可以修改映射器,以将单词作为键输出,然后将Text作为值输出,以表示单词来源的文件名.然后,在化简器中,您只需要简化文件名并将这些单词出现在多个文件中的那些条目输出.

You could amend the mapper to output the word as a the key, and then a Text as the value representing the filename of where the word came from. Then in your reducer, you just need to dedup the file names and output those entries where the word appears in more than a single file.

要获取要处理的文件的文件名取决于您是否使用新的API(mapred或mapreduce包名称).我知道对于新API,您可以使用

To get the filename of the file being processed depends on whether you're using the new API or not (mapred or mapreduce package names). I know for the new API you can extract the mapper input split from the Context object using the getInputSplit method (then probably case the InputSplit to a FileSplit, assuming you are using the TextInputFormat). For the old API, i've never tried it, but apparently you can use a configuration property called map.input.file

这对于引入合并器也是一个不错的选择-从同一映射器中删除多个出现的单词.

This would also be a good choice for introducing a Combiner - to dedup out multiple word occurrences from the same mapper.

更新

因此,为响应您的问题,您尝试使用一个名为"reporter"的实例变量,该变量在映射器的类scopt中不存在,请进行以下修改:

So in response to your problem, you're trying to use an instance variable called reporter, which doesn't exist in the class scopt of the mapper, amend as follows:

public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text> {
  // These are the original line; I am not using them but left them here...
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  //My changes are here too
  private Text outvalue=new Text();
  private String filename = null;

  public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
    if (filename == null) {
      filename = ((FileSplit) reporter.getInputSplit()).getPath().getName();
    }

    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
      word.set(tokenizer.nextToken());

      //    And here        
      outvalue.set(filename);
      output.collect(word, outvalue);
    }
  }
}

(真的不确定为什么SO不尊重上面的格式...)

(really not sure why SO isn't respecting the formatting in the above...)

这篇关于文件的字数统计常用字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆