MapReduce:如何输出没有换行符的键/值对? [英] MapReduce: How can I output key/value pair without newlines?

查看:575
本文介绍了MapReduce:如何输出没有换行符的键/值对?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用0减少方法来解决我的问题。我想从一个文件中预处理数据,然后将其写成另一个文件,但没有新的行和制表符?如何输出已处理数据的地图作业,并且文件格式与预处理相同。
那就是我有这样的东西:



预处理:

 < TITLE> Herp derp< / Title>我是一个主要的将军

发布过程:

  Herp 
Derp
I
am
a
主要
一般

我想要做的是这样的:

  Herp Derp我是一个主要的一般的

我相信问题是这行代码:

  job.setOutputFormatClass(TextOutputFormat.class); 

但是,当我尝试,天真地做一些像:

  job.setOutputFormatClass(null); 

显然不行。有没有提供可以使用这样做的格式类?如果没有,我怎么可以写我自己的课程,只要输出一切,我想要的?我是新来的hadoop和地图减少。



下面列出了我的地图功能。我不想使用reduce,因为它会在map和reducer之间进行排序。

  public void map(LongWritable key,Text value,Context context)
throws IOException,InterruptedException {

String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);


while(tokenizer.hasMoreTokens()){

word.set(tokenizer.nextToken());

//在这里进行预处理,与我的问题无关

context.write(word,null);
}
}

此外,我也google了这个,读了apache hadoop api,看看我能否闪烁一个答案。

解决方案

在你的mapper类中,而不是将你的行分解成单个的单词并写出来,尝试将整行发送到

  context.write(word,null); 

这样,它保持原来一起使用的整个字符串,而不是发送行



所以,把你的字符串分开来进行预处理工作,然后当你用context.write命令发送它时,把它放在一起。 >

I am using a 0 reduce approach to my problem. I wish to preprocess data from one file and then to write it out as another file, but with no new lines and tab delimeters? How can I output my map job that has processed my data with the same file format it came in minus the preprocess. That is, I have something like this:

Preprocess:

<TITLE> Herp derp </Title> I am a major general  

Post Process:

Herp 
Derp 
I 
am 
a
major
general

What I want it to do is this:

Herp Derp I am a major general 

I believe the issue is with this line of code:

job.setOutputFormatClass(TextOutputFormat.class);

However, when I tried, quite naively to do something like:

job.setOutputFormatClass(null);

It obviously would not work. Is there an format class that is provided that I can use to do this? If not, how could I write my own class to just output everything as I want? I am new to hadoop and map reduce.

I have included my map function below. I do not want to use reduce as it would sort between the map and reducer.

        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {

            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);


            while (tokenizer.hasMoreTokens()) {

                word.set(tokenizer.nextToken());

                //Did preprocessing here, irrelevant to my problem

                context.write(word, null);
            }
        }

Also, I have also googled this and read the apache hadoop api to see if I can gleam an answer.

解决方案

On your mapper class, instead of parsing your line into individual words and writing them out, try sending the entire line to the

context.write(word, null);

That way it is keeping the entire string you are originally working with together, instead of sending out the line piece by piece.

So, cut your string apart for the preprocess work, then put it back together when you send it out with the context.write command.

这篇关于MapReduce:如何输出没有换行符的键/值对?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆