Hadoop流 - 从减速器输出中删除尾随选项卡 [英] Hadoop streaming - remove trailing tab from reducer output

查看:183
本文介绍了Hadoop流 - 从减速器输出中删除尾随选项卡的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个hadoop串流作业,其输出不包含键/值对。您可以将其视为仅用于值的对或仅用于键的对。



我的流式缩减器(一种php脚本)正在输出由换行符分隔的记录。 Hadoop流式处理将其视为无值的键,并在换行符之前插入一个制表符。



我如何删除它?



我使用的是hadoop 1.0.3 AWS EMR。我下载了hadoop 1.0.3源代码,并在hadoop-1.0.3 / src / contrib / streaming / src / java / org / apache / hadoop / streaming / PipeReducer.java中找到了这个代码:

  reduceOutFieldSeparator = job_.get(stream.reduce.output.field.separator,\t)。getBytes(UTF-8); 

所以我尝试传递 -D stream.reduce.output.field.separator = 作为没有运气的工作的参数。我还尝试了 -D mapred.textoutputformat.separator = -D mapreduce.output.textoutputformat.separator = with no祝你好运。



我搜索了google当然没有找到工作。一个搜索结果甚至说没有可以传递的参数来达到理想的效果(尽管这种情况下的hadoop版本真的很老)。



这里是我的代码(为了便于阅读,增加了换行符):

  hadoop jar streaming.jar -files s3n:// path / to / a / file.json#file.json 
-D mapred.output.compress = true -D stream.reduce.output.field.separator =
-input s3n:// path / to / some /输入/ * / * -output hdfs:/// path / to / output / dir
-mapper'php my_mapper.php'-reducer'php my_reducer.php'


解决方案

查看org.apache.hadoop.mapreduce.lib.output.TextOutputFormat源代码,我看到两件事:


  1. 写(key,value)值是非空值

  2. 使用默认值( \ t )设置分隔符时, mapred.textoutputformat.separator retu rns null(我假设发生在 -D stream.reduce.output.field.separator =

您的唯一解决方案可能是编写您自己的OutputFormat,以解决这两个问题。



我的测试



在我有的任务中,我想要重新格式化一行来自

  id1 | val1 | val2 | val3 
id1 | val1

到:

  id1 | val1,val2,val3 
id2 | val1

我有一个自定义映射器(Perl脚本)来转换行。对于这个任务,我最初试图做为仅键值(或仅值)输入,但得到的结果是尾随选项卡。



起初我只需指定:


-D stream.map.input.field.separator ='|'-D stream.map.output.field。
$ / b

这给映射器一个键值对,因为我的映射无论如何都需要一个键。但是,这个输出现在有了第一个字段后面的标签



当我添加时,我得到了期望的输出:


-D mapred.textoutputformat.separator ='|'


如果我没有设置或设置
$ b


-D mapred.textoutputformat.separator =


然后我会在第一个字段后再次得到一个标签。



一旦我查看TextOutputFormat的源代码,它就很有意义了

I have a hadoop streaming job whose output does not contain key/value pairs. You can think of it as value-only pairs or key-only pairs.

My streaming reducer (a php script) is outputting records separated by newlines. Hadoop streaming treats this as a key with no value, and inserts a tab before the newline. This extra tab is unwanted.

How do I remove it?

I am using hadoop 1.0.3 with AWS EMR. I downloaded the source of hadoop 1.0.3 and found this code in hadoop-1.0.3/src/contrib/streaming/src/java/org/apache/hadoop/streaming/PipeReducer.java :

reduceOutFieldSeparator = job_.get("stream.reduce.output.field.separator", "\t").getBytes("UTF-8");

So I tried passing -D stream.reduce.output.field.separator= as an argument to the job with no luck. I also tried -D mapred.textoutputformat.separator= and -D mapreduce.output.textoutputformat.separator= with no luck.

I've searched google of course and nothing I found worked. One search result even stated there was no argument that could be passed to achieve the desired result (though, the hadoop version in that case was really really old).

Here is my code (with added line breaks for readability):

hadoop jar streaming.jar -files s3n://path/to/a/file.json#file.json
    -D mapred.output.compress=true -D stream.reduce.output.field.separator=
    -input s3n://path/to/some/input/*/* -output hdfs:///path/to/output/dir
    -mapper 'php my_mapper.php' -reducer 'php my_reducer.php'

解决方案

Looking at the org.apache.hadoop.mapreduce.lib.output.TextOutputFormat source, I see 2 things:

  1. The write(key,value) method writes a separator if key or value is non-null
  2. The separator is always set, using the default (\t), when the mapred.textoutputformat.separator returns null (which I'm assuming happens with -D stream.reduce.output.field.separator=

Your only solution maybe to write your own OutputFormat that works around these 2 issues.

My testing

In a task I had, I wanted to reformat a line from

id1|val1|val2|val3
id1|val1

into:

id1|val1,val2,val3
id2|val1

I had a custom mapper (Perl script) to convert the lines. And for this task, I initially tried to do as a key-only (or value-only) input, but got the results with the trailing tab.

At first I just specified:

-D stream.map.input.field.separator='|' -D stream.map.output.field.separator='|'

This gave the mapper a key, value pair, since my mapping wanted a key anyway. But this output now had the tab after the first field

I got the desired output when I added:

-D mapred.textoutputformat.separator='|'

If I didn't set it or set to blank

-D mapred.textoutputformat.separator=

then I would again get a tab after the first field.

It made sense once I looked at the source for TextOutputFormat

这篇关于Hadoop流 - 从减速器输出中删除尾随选项卡的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆