如何重新安排wordcount hadoop输出结果并按值排序 [英] How to re-arrange wordcount hadoop output result and sort them by value

查看:467
本文介绍了如何重新安排wordcount hadoop输出结果并按值排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用下面的代码获得输出结果,如(Key,Value)

  Apple 12 
Bee 345
Cat 123

我想要的是按值排序的降序(345)并将它们放在前面(价值,关键)

  345 Bee 
123 Cat
12 Apple

我发现有一些称为次级排序的东西不会说谎,但我很迷茫 - 我试图改变。 。 context.write(key,result); 但是失败了。我是Hadoop的新手,不确定如何开始解决这个问题。任何建议,将不胜感激。我需要更改哪个功能?或者我需要修改哪些类?



这里是我的课程:

  package org.apache。 hadoop.examples; 

import java.io.IOException;
import java.util.StringTokenizer;

导入org.apache.hadoop.conf.Configuration;
导入org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

public static class TokenizerMapper
extends Mapper< Object,Text,Text,IntWritable> {

private final static IntWritable one = new IntWritable(1);
私人文字=新文字();

public void map(Object key,Text value,Context context
)throws IOException,InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while(itr.hasMoreTokens()){
word.set(itr.nextToken());
context.write(word,one);




public static class IntSumReducer
extends Reducer< Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
$ b public void reduce(Text key,Iterable< IntWritable> values,
Context context
)throws IOException,InterruptedException {
int sum = 0; (IntWritable val:values)
{
sum + = val.get();
}
result.set(sum);
context.write(key,result);



public static void main(String [] args)throws Exception {
Configuration conf = new Configuration();
String [] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
if(otherArgs.length <2){
System.err.println(Usage:wordcount< in> [< in>< out>
System.exit(2);
}
工作工作=新工作(conf,字数);
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
for(int i = 0; i< otherArgs.length - 1; ++ i){
FileInputFormat.addInputPath(job,new Path(otherArgs [i]));
}
FileOutputFormat.setOutputPath(job,
new Path(otherArgs [otherArgs.length - 1]));
System.exit(job.waitForCompletion(true)?0:1);



$ div $解析方案



您需要第二张map only作业才能执行降序排序和交换键值的第二个要求


  1. 使用DecreasingComparator作为排序比较器
  2. 使用InverseMapper交换密钥和值

  3. 使用Identity Reducer即Reducer.class - 在Identity Reducer的情况下,不会发生聚合(因为每个值分别为key输出)
  4. 将reduce任务的数量设置为1或使用TotalOderPartitioner
  5. li>


I use this code below to get output result like ( Key , Value )

Apple 12
Bee 345 
Cat 123

What I want is descending sorted by value ( 345 ) and place them before the key ( Value , Key )

345 Bee
123 Cat
12 Apple

I found there are something called "secondary sorted" not going to lie but I'm so lost - I tried to change .. context.write(key, result); but failed miserably. I'm new to Hadoop and not sure how can I start to tackle this problem. Any recommendation would be appreciated. Which function do I need to change ? or which class do I need modify ?

here 'are my classes :

package org.apache.hadoop.examples;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

  public static class TokenizerMapper 
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer 
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, 
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length < 2) {
      System.err.println("Usage: wordcount <in> [<in>...] <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    for (int i = 0; i < otherArgs.length - 1; ++i) {
      FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
    }
    FileOutputFormat.setOutputPath(job,
      new Path(otherArgs[otherArgs.length - 1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

解决方案

You have been able to do word count correctly.

You will need second map only job to perform the second requirement of descending sort and swapping of key value

  1. Use DecreasingComparator as sort comparator
  2. Use InverseMapper to swap key and values
  3. Use Identity Reducer i.e. Reducer.class - In case of Identity Reducer no aggregation will happen ( as each value is output individually for key )
  4. Set number of reduce tasks to 1 or use TotalOderPartitioner

这篇关于如何重新安排wordcount hadoop输出结果并按值排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆