hadoop 将 转换为 并破坏 ARC 格式 [英] hadoop converting to and breaking ARC format

查看:36
本文介绍了hadoop 将 转换为 并破坏 ARC 格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 hadoop 流解析来自 commoncrawl.org 的数据.我设置了一个本地 hadoop 来测试我的代码,并有一个使用流式 ARCfile 阅读器的简单 Ruby 映射器.当我自己调用我的代码时

I am trying to parse data from commoncrawl.org using hadoop streaming. I set up a local hadoop to test my code, and have a simple Ruby mapper which uses a streaming ARCfile reader. When I invoke my code myself like

cat 1262876244253_18.arc.gz | mapper.rb | reducer.rb

它按预期工作.

似乎 hadoop 会自动看到该文件具有 .gz 扩展名并在将其交给映射器之前对其进行解压缩 - 但是在这样做的同时,它会将流中的 换行符转换为 .由于 ARC 依赖于标题行中的记录长度,因此更改会破坏解析器(因为数据长度已更改).

It seems that hadoop automatically sees that the file has a .gz extension and decompresses it before handing it to a mapper - however while doing so it converts linebreaks in the stream to . Since ARC relies on a record length in the header line, the change breaks the parser (because the data length has changed).

为了仔细检查,我改变了我的映射器以期望未压缩的数据,并且做到了:

To double check, I changed my mapper to expect uncompressed data, and did:

cat 1262876244253_18.arc.gz | zcat | mapper.rb | reducer.rb

它有效.

我不介意 hadoop 自动解压缩(虽然我可以很高兴地处理流式 .gz 文件),但如果确实如此,我需要它以二进制"解压缩而不进行任何换行转换或类似操作.我相信默认行为是将解压缩文件提供给每个文件的一个映射器,这是完美的.

I don't mind hadoop automatically decompressing (although I can quite happily deal with streaming .gz files), but if it does I need it to decompress in 'binary' without doing any linebreak conversion or similar. I believe that the default behaviour is to feed decompressed files to one mapper per file, which is perfect.

我怎样才能要求它不解压缩 .gz(重命名文件不是一种选择)或使其正确解压缩?如果可能的话,我宁愿不使用必须装在 jar 中的特殊 InputFormat 类.

How can I either ask it not to decompress .gz (renaming the files is not an option) or make it decompress properly? I would prefer not to use a special InputFormat class which I have to ship in a jar, if at all possible.

所有这些最终都将在 AWS ElasticMapReduce 上运行.

All of this will eventually run on AWS ElasticMapReduce.

推荐答案

看起来 Hadoop PipeMapper.java 是罪魁祸首(至少在 0.20.2 中):

Looks like the Hadoop PipeMapper.java is to blame (at least in 0.20.2):

在第 106 行左右,来自 TextInputFormat 的输入被传递给这个映射器(在这个阶段 已被剥离),并且 PipeMapper 将它写到标准输出中,并且只包含一个 .

Around line 106, the input from TextInputFormat is passed to this mapper (at which stage the has been stripped), and the PipeMapper is writing it out to stdout with just a .

建议修改 PipeMapper.java 的源代码,检查此功能"是否仍然存在,并根据需要进行修改(可能允许通过配置属性对其进行设置).

A suggestion would be to amend the source for your PipeMapper.java, check this 'feature' still exists, and amend as required (maybe allow it to be set via a configuration property).

这篇关于hadoop 将 转换为 并破坏 ARC 格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆