Hadoop-如何收集无值的文本输出 [英] Hadoop - How to Collect Text Output Without Values

查看:120
本文介绍了Hadoop-如何收集无值的文本输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事地图简化工作,我想知道是否有可能向我的输出文件发出自定义字符串.没有计数,没有其他数量,只是一小段文字.

I am working on a map reduce job, and I am wondering if it is possible to emit a custom string to my output file. No counts, no other quantities, just a blob of text.

这是我在想什么的基本思想

Here's the basic ideas of what Im thinking about

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
        // this map doesn't do very much
        String line = value.toString();
        word.set(line);
        // emit to map output
        output.collect(word,one);

        // but how to i do something like output.collect(word)
        // because in my output file I want to control the text 
        // this is intended to be a map only job
    }
}

这种事情可能吗?这是通过使用hadoop的并行性来创建仅映射作业来转换数据,但不一定是整个MR框架.当我运行此作业时,我会在每个映射器的hdfs中获得一个输出文件.

Is this kind of thing possible? This is to create a map only job to transform data, using hadoop for its parallelism, but not necessarily the whole MR framework. When I run this job I get an output file in hdfs for each mapper.

$ hadoop fs -ls /Users/dwilliams/output
2013-09-15 09:54:23.875 java[3902:1703] Unable to load realm info from SCDynamicStore
Found 12 items
-rw-r--r--   1 dwilliams supergroup          0 2013-09-15 09:52 /Users/dwilliams/output/_SUCCESS
drwxr-xr-x   - dwilliams supergroup          0 2013-09-15 09:52 /Users/dwilliams/output/_logs
-rw-r--r--   1 dwilliams supergroup    7223469 2013-09-15 09:52 /Users/dwilliams/output/part-00000
-rw-r--r--   1 dwilliams supergroup    7225393 2013-09-15 09:52 /Users/dwilliams/output/part-00001
-rw-r--r--   1 dwilliams supergroup    7223560 2013-09-15 09:52 /Users/dwilliams/output/part-00002
-rw-r--r--   1 dwilliams supergroup    7222830 2013-09-15 09:52 /Users/dwilliams/output/part-00003
-rw-r--r--   1 dwilliams supergroup    7224602 2013-09-15 09:52 /Users/dwilliams/output/part-00004
-rw-r--r--   1 dwilliams supergroup    7225045 2013-09-15 09:52 /Users/dwilliams/output/part-00005
-rw-r--r--   1 dwilliams supergroup    7222759 2013-09-15 09:52 /Users/dwilliams/output/part-00006
-rw-r--r--   1 dwilliams supergroup    7223617 2013-09-15 09:52 /Users/dwilliams/output/part-00007
-rw-r--r--   1 dwilliams supergroup    7223181 2013-09-15 09:52 /Users/dwilliams/output/part-00008
-rw-r--r--   1 dwilliams supergroup    7223078 2013-09-15 09:52 /Users/dwilliams/output/part-00009

如何在1个文件中获得结果?我应该使用身份减少器吗?

How do I get the results in 1 file? Should I use the identity reducer?

推荐答案

1..要实现 output.collect(word),您可以使用 NullWritable类 .为此,您必须在Mapper中使用 output.collect(word,NullWritable.get()).请注意,NullWritable是Singleton.

1. To achieve output.collect(word) you could make use of the Class NullWritable. To do that you have to use output.collect(word, NullWritable.get()) in your Mapper. Note that NullWritable is Singleton.

2..如果您不希望有多个文件,可以将reducer的数量设置为1.但这会产生额外的开销,因为这将涉及到网络上的大量数据改组.原因是,Reducer必须在运行Mappers的不同机器上获取其输入表单.同样,所有负载将仅分配给一台计算机.但是,如果只需要一个输出文件,则绝对可以使用一个mReducer. conf.setNumReduceTasks(1)应该足以实现这一目标.

2. If you do not want to have multiple files you can set the number of reducers to 1. But this would incur additional overhead since this will involve a lot of data shuffling over the network. Reason being, the Reducer has to get its input form n different machines where Mappers were running. Also, all the load will go to just one machine. But you can definitely use one mReducer if you want just one output file. conf.setNumReduceTasks(1) should be sufficient to achieve that.

一些小建议:

  • 我不建议您使用 getmerge ,因为它将结果文件复制到本地FS .因此,您必须将其复制回HDFS以便进一步使用.
  • 尽可能使用新的API.
  • I would not sugesst you to use getmerge as it copies the resulting file onto the local FS. As a result, you have to copy it back to the HDFS in order to use it further.
  • Use the new API if possible for you.

这篇关于Hadoop-如何收集无值的文本输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆