每个文件数WORDCOUNT例子 [英] WordCount example with Count per file

查看:170
本文介绍了每个文件数WORDCOUNT例子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个问题,以便获得每个文件词语出现的总次数的破裂。
例如,我有四个文本文件(T1,T2,T3,T4)。字w1为两倍于文件t2和一旦在t4时,有三个的总发生。
我想写在输出文件相同的信息。
我得到的每个文件中的单词总数,但不能让我如上想要的结果。

下面是我的地图类。

 进口java.io.IOException异常;
进口的java.util。*;导入org.apache.hadoop.io *。
进口org.apache.hadoop.ma preduce *。
//添加行
进口org.apache.hadoop.ma preduce.lib.input *。公共类地图扩展映射器LT; LongWritable,文本,文本,IntWritable> {
私人最终静态IntWritable 1 =新IntWritable(1);
私人文本字=新文本();
私人字符串模式=^ [A-Z] [A-Z0-9] * $;公共无效图(LongWritable键,文本价值,上下文的背景下)抛出IOException异常,InterruptedException的{
    串行= value.toString();
    StringTokenizer的标记生成器=新的StringTokenizer(线);
    //添加行
    InputSplit inputSplit = context.getInputSplit();
    字符串文件名=((FileSplit)inputSplit).getPath()的getName();    而(tokenizer.hasMoreTokens()){
        word.set(tokenizer.nextToken());
        。字符串stringWord = word.toString()与toLowerCase();
        如果((stringWord).matches(模式)){
            //context.write(new文本(stringWord),一个);
            context.write(新文本(stringWord),一个);
            context.write(新文本(文件名),一个);
            //System.out.println(fileName);
            }
        }
    }
}


解决方案

在我们可以设置文本文件名作为关键字映射器的输出和文件作为值中的每一行。该减速机为您提供了文件名中的单词及其相应的数量。

 公共类减少扩展减速<文本,文本,文本,文本> {
    HashMap的<字符串,整数>输入=新的HashMap<字符串,整数>();    公共无效减少(文字键,可迭代<文本>的价值观,上下文的背景下)
    抛出IOException异常,InterruptedException的{
        INT总和= 0;
        对于(文字VAL:值){
            串词= val.toString(); - 处理每行
            的String [] = wordarray word.split(''); - 假设分隔符是一个空格
            的for(int i = 0; I< wordarray.length;我++)
           {
            如果(input.get(wordarray [I])== NULL){
            input.put(wordarray [I],1);}
            其他{
             int值= input.get(wordarray [I])+1;
             input.put(wordarray [I],值);
             }
           }       context.write(新文本(键),新文本(input.toString()));
    }

I am having an issue to get the breakdown of the total number of occurrences of words per file. for example, I have four text files (t1, t2, t3, t4). word w1 is twice in file t2, and once in t4, with total occurrences of three. I want to write the same information in output file. I am getting total number of words in each file, but can't get the result i want as above.

Here is my map class.

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
//line added
import org.apache.hadoop.mapreduce.lib.input.*;

public class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private String pattern= "^[a-z][a-z0-9]*$";

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    //line added
    InputSplit inputSplit = context.getInputSplit();
    String fileName = ((FileSplit) inputSplit).getPath().getName();

    while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        String stringWord = word.toString().toLowerCase();
        if ((stringWord).matches(pattern)){
            //context.write(new Text(stringWord), one);
            context.write(new Text(stringWord), one);
            context.write(new Text(fileName), one);
            //System.out.println(fileName);
            }
        }
    }
}

解决方案

In output of the mapper we can set the text file name as key and each row in the file as the value. This reducer gives you the file name the word and its corresponding count.

public class Reduce extends Reducer<Text, Text, Text, Text> {
    HashMap<String, Integer>input = new HashMap<String, Integer>();

    public void reduce(Text key, Iterable<Text> values , Context context)
    throws IOException, InterruptedException {
        int sum = 0;
        for(Text val: values){
            String word = val.toString(); -- processing each row
            String[] wordarray = word.split(' '); -- assuming the delimiter is a space
            for(int i=0 ; i<wordarray.length; i++)
           {
            if(input.get(wordarray[i]) == null){
            input.put(wordarray[i],1);}
            else{
             int value =input.get(wordarray[i]) +1 ; 
             input.put(wordarray[i],value);
             }
           }     

       context.write(new Text(key), new Text(input.toString()));
    }

这篇关于每个文件数WORDCOUNT例子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆