如何在Hadoop中创建自定义输出格式 [英] How to create a custom output format in Hadoop

查看:196
本文介绍了如何在Hadoop中创建自定义输出格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个单词计数hadoop程序的变体,其中该程序读取目录中的多个文件并输出每个单词的频率.关键是,我希望它输出一个单词,后面跟着文件名以及该文件的频率.例如:

I am trying to create a variation of the word count hadoop program in which it reads multiple files in a directory and outputs the frequency of each word. The thing is, I want it to output a word followed by the file name is came from and the frequency from that file. for example:

word1
( file1, 10)
( file2, 3)
( file3, 20)

所以对于单词1(假设单词和").它发现它是file1的10倍,在file2中的3倍,依此类推.现在它只输出一个键值对

So for word1 (say the word "and"). It finds it 10 times is file1, 3 times in file2, ect. Right now it is outputing only a key value pair

 StringTokenizer itr = new StringTokenizer(chapter);
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());

    context.write(word, one);

我可以通过

String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

但是我不明白如何格式化我想要的方式.我一直在研究OutputCollector,但不确定如何正确使用它.

But I do not understand how to format the way I want. I've been looking into OutputCollector, but I am unsure of how to use it exactly.

这是我的地图绘制者和救援者

This is my mapper and recuder

public static class TokenizerMapper
   extends Mapper<Object, Text, Text, Text>{ 

private Text word = new Text();

public void map(Object key, Text value, Context context
                ) throws IOException, InterruptedException {

  //Take out all non letters and make all lowercase
  String chapter = value.toString();
  chapter = chapter.toLowerCase();
  chapter = chapter.replaceAll("[^a-z]"," ");

  //This is the file name
  String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

  StringTokenizer itr = new StringTokenizer(chapter);
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());

   context.write(word, new Text(fileName)); //
  }
}
  }


  public static class IntSumReducer
       extends Reducer<Text,Text,Text,Text> { second


   public void reduce(Text key, Iterable<Text> values, Context context)
         throws IOException, InterruptedException {

  Map<String, Integer> files = new HashMap<String, Integer>();

 for (Text val : values) {
    if (files.containsKey(val.toString())) {
        files.put(val.toString(), files.get(val.toString())+1);
    } else {
        files.put(val.toString(), 1); 
    }
}

String outputString="";

for (String file : files.keySet()) { 
    outputString = outputString + "\n<" + file + ", " + files.get(file) + ">"; //files.get(file)
}

context.write(key, new Text(outputString));
}

  }

例如,这将输出单词"a":

This is outputting for the word "a" for example:

a   
(
(chap02, 53), 1)
(
(chap18, 50), 1)

我不确定为什么要将每个键的键值配对为1的键.

I am unsure of why its making a key value pair a key for a value 1 for each entry.

推荐答案

我认为您根本不需要自定义输出格式.只要将文件名传递给reducer,就应该能够简单地通过修改TextOutputFormat类型操作中使用的String来执行此操作.解释如下.

I don't think you need a custom output format at all for this. So long as you pass the filename along to the reducer, you should be able to do this simply by modifying the String that you use within a TextOutputFormat type operation. Explanation is below.

在映射器中获取文件名,并将其附加到如下所示的textInputFormat

In the mapper get the filename, and append it to a textInputFormat as below

String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
context.write(key,new Text(fileName));

然后在减速器中执行以下操作:

Then in the reducer do something like the following:

public void reduce(Text key, Iterable<Text> values, Context context)
        throws IOException, InterruptedException {
    Map<String, Integer> files = new HashMap<String, Integer>();
    for (Text val : values) {
        if (files.containsKey(val.toString())) {
            files.put(val.toString(), files.get(val.toString()) + 1);
        } else {
            files.put(val.toString(), 1);
        }
    }

    String outputString = key.toString();

    for (String file : files.keySet()) {
        outputString += "\n( " + file + ", " + files.get(file) + ")";
    }

    context.write(key, new Text(outputString));
}

此精简程序将"\n"附加到每一行的开头,以强制显示格式完全符合您的要求.

This reducer appends "\n" to the beginning of every line, in order to force the display formatting to be exactly what you want.

这似乎比编写自己的输出格式简单得多.

This seems much simpler than writing your own outputformat.

这篇关于如何在Hadoop中创建自定义输出格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆