Java Hadoop:我如何创建映射器作为输入文件并给出每个文件中行数的输出结果? [英] Java Hadoop: How can I create mappers that take as input files and give an output which is the number of lines in each file?

查看:234
本文介绍了Java Hadoop:我如何创建映射器作为输入文件并给出每个文件中行数的输出结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Hadoop的新手,我管理的只是运行wordCount示例: http://hadoop.apache.org/common/docs/r0.18.2/mapred_tutorial.html



假设我们有一个文件夹有3个文件。我希望每个文件都有一个映射器,这个映射器只会计算行数并将其返回给reducer。



然后reducer将作为输入每个映射器的行数,并给出所有3个文件中存在的行的总数。所以如果我们有以下3个文件 p>

  input1.txt 
input2.txt
input3.txt



和映射器返回:

  mapper1  - > ; [input1.txt,3] 
mapper2 - > [input2.txt,4]
mapper3 - > [input3.txt,9]

缩减器将输出

  3 + 4 + 9 = 16 

我在一个简单的java应用程序中完成了这个工作,所以我想在Hadoop中完成它。我只有一台电脑,并想尝试在伪分布式环境中运行。



我该如何实现这个目标?我应该做什么适当的步骤?



我的代码应该如何在apache的例子中看起来像?我将有两个静态类,一个用于缩放器的mapper?或者我应该有3班,每个映射器一个?



如果您可以请指导我通过这个,我不知道如何做到这一点,我相信如果我管理编写一些代码来做这些事情,那么我将来可以编写更复杂的应用程序。



谢谢!



  public class LineMapper扩展了Mapper< LongWritable,Text,Text,LongWritable> {
保护长行= 0;

@Override
protected void cleanup(Context context)抛出IOException,
InterruptedException {
FileSplit split =(FileSplit)context.getInputSplit();
String filename = split.getPath()。toString();

context.write(new Text(filename),new LongWritable(lines));

$ b @Override
protected void map(LongWritable key,Text value,Context context)
throws IOException,InterruptedException {
lines ++;
}
}


I'm new to Hadoop and I've managed just to run the wordCount example: http://hadoop.apache.org/common/docs/r0.18.2/mapred_tutorial.html

Suppose we have a folder which has 3 files. I want to have one mapper for each file, and this mapper will just count the number of lines and return it to the reducer.

The reducer will then take as an input the number of lines from each mapper, and give as an output the total number of lines that exist in all 3 files.

So if we have the following 3 files

input1.txt
input2.txt
input3.txt

and the mappers return:

mapper1 -> [input1.txt, 3]
mapper2 -> [input2.txt, 4]
mapper3 -> [input3.txt, 9]

the reducer will give an output of

3+4+9 = 16 

I have done this in a simple java application so I would like to do it in Hadoop. I have just 1 computer and would like to try running on a pseudo distributed environment.

How can I achieve this thing? What proper steps should I make?

Should my code look like that in the example by apache? I will have two static classes, one for mapper one for reducer? or should I have 3 classes, one for each mapper?

if you can please guide me through this, I have no idea how to do this and I believe if I manage to write some code that does this stuff then I will be able to write more complex application in the future.

Thanks!

解决方案

In addition to sa125's answer, you can hugely improve performance by not emitting a record for each input record, but rather just accumulate a counter in the mapper, and then in the mapper clean-up method, emit the filename and count value:

public class LineMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
    protected long lines = 0;

    @Override
    protected void cleanup(Context context) throws IOException,
            InterruptedException {
        FileSplit split = (FileSplit) context.getInputSplit();
        String filename = split.getPath().toString();

        context.write(new Text(filename), new LongWritable(lines));
    }

    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        lines++;
    }
}

这篇关于Java Hadoop:我如何创建映射器作为输入文件并给出每个文件中行数的输出结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆