hadoop将\r \\\<br/>转换为\ n并打破ARC格式 [英] hadoop converting \r\n to \n and breaking ARC format

查看:103
本文介绍了hadoop将\r \\\<br/>转换为\ n并打破ARC格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用hadoop流解析来自commoncrawl.org的数据。我设置了一个本地hadoop来测试我的代码,并且有一个简单的Ruby映射器,它使用流ARCfile文件读取器。当我自己调用我的代码时,就像

  cat 1262876244253_18.arc.gz | mapper.rb | reducer.rb 

它按预期工作。

<看起来hadoop会自动看到该文件的扩展名为.gz,并在将其传递给映射器之前对其进行解压缩 - 但是,这样做会将流中的\ r \ linebreaks转换为\\\
。由于ARC依赖于标题行中的记录长度,因此更改会破坏解析器(因为数据长度已更改)。



为了仔细检查,我将映射器更改为希望得到未压缩的数据,并且做到了:

  cat 1262876244253_18.arc.gz | zcat | mapper.rb | reducer.rb 

运作良好。

我不介意hadoop自动解压缩(尽管我可以非常高兴地处理流式.gz文件),但是如果需要,我需要它以二进制解压缩,而不进行任何换行符转换或类似操作。我相信默认行为是将解压缩的文件提供给每个文件的一个映射器,这是完美的。



我怎么能要求它不要解压缩.gz(重命名文件不是一个选项)或正确解压缩?我不希望使用一个特殊的InputFormat类,如果可能的话,我不得不使用一个jar包。



所有这些最终都会在AWS ElasticMapReduce上运行。 / p>

解决方案

看起来像Hadoop PipeMapper.java是怪罪(至少在0.20.2):



围绕第106行,TextInputFormat的输入被传递给这个映射器(在这一阶段\r\\\
已被剥离),并且PipeMapper将它写出到标准输出只是一个\ n。



建议修改PipeMapper.java的源代码,检查这个功能是否存在,并根据需要进行修改(可能允许它通过配置属性设置)。


I am trying to parse data from commoncrawl.org using hadoop streaming. I set up a local hadoop to test my code, and have a simple Ruby mapper which uses a streaming ARCfile reader. When I invoke my code myself like

cat 1262876244253_18.arc.gz | mapper.rb | reducer.rb

It works as expected.

It seems that hadoop automatically sees that the file has a .gz extension and decompresses it before handing it to a mapper - however while doing so it converts \r\n linebreaks in the stream to \n. Since ARC relies on a record length in the header line, the change breaks the parser (because the data length has changed).

To double check, I changed my mapper to expect uncompressed data, and did:

cat 1262876244253_18.arc.gz | zcat | mapper.rb | reducer.rb

And it works.

I don't mind hadoop automatically decompressing (although I can quite happily deal with streaming .gz files), but if it does I need it to decompress in 'binary' without doing any linebreak conversion or similar. I believe that the default behaviour is to feed decompressed files to one mapper per file, which is perfect.

How can I either ask it not to decompress .gz (renaming the files is not an option) or make it decompress properly? I would prefer not to use a special InputFormat class which I have to ship in a jar, if at all possible.

All of this will eventually run on AWS ElasticMapReduce.

解决方案

Looks like the Hadoop PipeMapper.java is to blame (at least in 0.20.2):

Around line 106, the input from TextInputFormat is passed to this mapper (at which stage the \r\n has been stripped), and the PipeMapper is writing it out to stdout with just a \n.

A suggestion would be to amend the source for your PipeMapper.java, check this 'feature' still exists, and amend as required (maybe allow it to be set via a configuration property).

这篇关于hadoop将\r \\\<br/>转换为\ n并打破ARC格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆