hadoop倒排索引而没有重复出现文件名 [英] hadoop inverted-index without recurrence of file names

查看:182
本文介绍了hadoop倒排索引而没有重复出现文件名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我输出的是:

单词,文件
----- ------
wordx Doc2, Doc1,Doc1,Doc1,Doc1,Doc1,Doc1,Doc1



我想要的是:

文件
----- ------
wordx Doc2,Doc1

  public static class LineIndexMapper扩展MapReduceBase 
实现Mapper< LongWritable,Text,Text,Text> {

private final static Text word = new Text();
private final static Text location = new Text();
$ b $ public void map(LongWritable key,Text val,
OutputCollector< Text,Text>输出,Reporter记者)
抛出IOException {
FileSplit fileSplit =(FileSplit) reporter.getInputSplit();
String fileName = fileSplit.getPath()。getName();
location.set(fileName);

String line = val.toString();
StringTokenizer itr = new StringTokenizer(line.toLowerCase());
while(itr.hasMoreTokens()){
word.set(itr.nextToken());
output.collect(word,location);



$ b public static class LineIndexReducer扩展MapReduceBase
实现Reducer<文本,文本,文本,文本> {
$ b $ public void reduce(Text key,Iterator< Text> values,
OutputCollector< Text,Text>输出,Reporter记者)
抛出IOException {

boolean first = true;
StringBuilder toReturn = new StringBuilder();
while(values.hasNext()){
if(!first){
toReturn.append(,);
}
first = false;
toReturn.append(values.next()。toString());
}

output.collect(key,new Text(toReturn.toString()));


获得最佳表现 - 我应该在哪里跳过周期性文件名?地图,减少还是两者?
ps:我是一名初学者,负责编写MR任务,并试图找出编程逻辑与我的问题。

解决方案

您将只能删除Reducer中的重复项。为此,您可以使用一个Set,它不允许重复。

  public void reduce(Text key,Iterator< Text> ;值,
OutputCollector< Text,Text>输出,Reporter记者)
抛出IOException {

//应该重载Text的equals()方法来完成这个工作
设置<文本> outputValues = new HashSet< Text>();

while(values.hasNext()){
//创建一个新对象,因为Hadoop可能会混淆原始
Text value = new Text(values.next());

//负责删除重复项
outputValues.add(value);
}

boolean first = true;
StringBuilder toReturn = new StringBuilder();
迭代器<文本> outputIter = outputValues.iter();
while(outputIter.hasNext()){
if(!first){
toReturn.append(,);
}
first = false;
toReturn.append(outputIter.next()。toString());
}

output.collect(key,new Text(toReturn.toString()));
}

编辑:将值副本添加到设置为每克里斯的评论。


what i have in output is:

word , file ----- ------ wordx Doc2, Doc1, Doc1, Doc1, Doc1, Doc1, Doc1, Doc1

what i want is:

word , file ----- ------ wordx Doc2, Doc1

public static class LineIndexMapper extends MapReduceBase
        implements Mapper<LongWritable, Text, Text, Text> {

    private final static Text word = new Text();
    private final static Text location = new Text();

    public void map(LongWritable key, Text val,
            OutputCollector<Text, Text> output, Reporter reporter)
            throws IOException {
        FileSplit fileSplit = (FileSplit) reporter.getInputSplit();
        String fileName = fileSplit.getPath().getName();
        location.set(fileName);

        String line = val.toString();
        StringTokenizer itr = new StringTokenizer(line.toLowerCase());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            output.collect(word, location);
        }
    }
}

public static class LineIndexReducer extends MapReduceBase
        implements Reducer<Text, Text, Text, Text> {

    public void reduce(Text key, Iterator<Text> values,
            OutputCollector<Text, Text> output, Reporter reporter)
            throws IOException {

        boolean first = true;
        StringBuilder toReturn = new StringBuilder();
        while (values.hasNext()) {
            if (!first) {
                toReturn.append(", ");
            }
            first = false;
            toReturn.append(values.next().toString());
        }

        output.collect(key, new Text(toReturn.toString()));
    }
}

for the best performance - where should i skip the recurring file name? map,reduce or both? ps: i am a beginner in writing MR tasks and also trying to figure out programming logic with my question.

解决方案

You will only be able to remove duplicates in the Reducer. To do so, you can use a Set, which does not allow duplicates.

public void reduce(Text key, Iterator<Text> values,
        OutputCollector<Text, Text> output, Reporter reporter)
        throws IOException {

    // Text's equals() method should be overloaded to make this work
    Set<Text> outputValues = new HashSet<Text>();

    while (values.hasNext()) {
      // make a new Object because Hadoop may mess with original
      Text value = new Text(values.next());

      // takes care of removing duplicates
      outputValues.add(value);
    }

    boolean first = true;
    StringBuilder toReturn = new StringBuilder();
    Iterator<Text> outputIter = outputValues.iter();
    while (outputIter.hasNext()) {
        if (!first) {
            toReturn.append(", ");
        }
        first = false;
        toReturn.append(outputIter.next().toString());
    }

    output.collect(key, new Text(toReturn.toString()));
}

Edit: Adds copy of value to Set as per Chris' comment.

这篇关于hadoop倒排索引而没有重复出现文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆