如何使用" typedbytes"或QUOT; rawbytes"在Hadoop的流媒体? [英] How to use "typedbytes" or "rawbytes" in Hadoop Streaming?

查看:390
本文介绍了如何使用" typedbytes"或QUOT; rawbytes"在Hadoop的流媒体?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有将由Hadoop的流中typedbytes或rawbytes模式,这允许一个分析Java以外的语言的二进制数据可以解决的一个问题。 (没有这一点,流媒体间$ P $点一些字符,通常是\\ t和\\ n作为分隔符和抱怨非UTF-8字符。我所有的二进制数据转换为Base64编码会减慢流程,击败的目的。)

I have a problem that would be solved by Hadoop Streaming in "typedbytes" or "rawbytes" mode, which allow one to analyze binary data in a language other than Java. (Without this, Streaming interprets some characters, usually \t and \n, as delimiters and complains about non-utf-8 characters. Converting all my binary data to Base64 would slow down the workflow, defeating the purpose.)

这些二进制模式由 HADOOP-1722 补充说。在调用在Hadoop流作业的命令行,-io rawbytes让你定义的数据为32位整数大小其次是大小的原始数据,而-io typedbytes让你定义的数据为1比特零(这意味着原始字节),接着是32位整数的大小,随后是大小的原始数据。我已创建了这些格式的文件(有一个或多个记录),并证实他们是在正确的格式与/检查他们反对的 typedbytes.py 。我也尝试了所有可能的变化(大端,小端,不同的字节偏移等)。我使用Hadoop的<一个href=\"https://ccp.cloudera.com/display/CDH4DOC/Installing+CDH4+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode\"相对=nofollow> 0.20从CDH4 时,其具有实现所述typedbytes处理的类,它是在进入这些类时,-io开关被设置

These binary modes were added by HADOOP-1722. On the command line that invokes a Hadoop Streaming job, "-io rawbytes" lets you define your data as a 32-bit integer size followed by raw data of that size, and "-io typedbytes" lets you define your data as a 1-bit zero (which means raw bytes), followed by a 32-bit integer size, followed by raw data of that size. I have created files with these formats (with one or many records) and verified that they are in the right format by checking them with/against the output of typedbytes.py. I've also tried all conceivable variations (big-endian, little-endian, different byte offsets, etc.). I'm using Hadoop 0.20 from CDH4, which has the classes that implement the typedbytes handling, and it is entering those classes when the "-io" switch is set.

我复制的二进制文件,HDFS与Hadoop的FS -copyFromLocal。当我尝试使用它输入到一个地图,减少工作,失败就行了OutOfMemoryError而它试图使长度的字节数组,我指定(例如3字节)。它必须是不正确的阅读数量和试图分配一个巨大的块来代替。尽管如此,它设法得到了创纪录的映射器(在previous记录?不知道),它写入标准错误,这样我可以看到它。总有一些在的开始的记录太多字节:例如,如果文件是\\ X00 \\ X00 \\ X00 \\ X00 \\ x03hey,映射器会看到\\ X04 \\ X00 \\ X00 \\ X00 \\ X00 \\ X00 \\ X00 \\ X00 \\ X00 \\ X07 \\ X00 \\ X00 \\ X00 \\ X08 \\ X00 \\ X00 \\ X00 \\ X00 \\ x03hey(可重复位,虽然没有模式,我可以看到)。

I copied the binary file to HDFS with "hadoop fs -copyFromLocal". When I try to use it as input to a map-reduce job, it fails with an OutOfMemoryError on the line where it tries to make a byte array of the length I specify (e.g. 3 bytes). It must be reading the number incorrectly and trying to allocate a huge block instead. Despite this, it does manage to get a record to the mapper (the previous record? not sure), which writes it to standard error so that I can see it. There are always too many bytes at the beginning of the record: for instance, if the file is "\x00\x00\x00\x00\x03hey", the mapper would see "\x04\x00\x00\x00\x00\x00\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\x00\x03hey" (reproducible bits, though no pattern that I can see).

这次谈话,我据悉,有loadtb和流媒体的dumptb子命令,它复制到/从HDFS和包装/解开的输入字节的SequenceFile,一步到位。当-inputformat org.apache.hadoop.ma pred.SequenceFileAsBinaryInputFormat使用正确的Hadoop解压SequenceFile,而是包含在随后misinter $ P $点的typedbytes,在完全相同的方式。

From page 5 of this talk, I learned that there are "loadtb" and "dumptb" subcommands of streaming, which copy to/from HDFS and wrap/unwrap the typed bytes in a SequenceFile, in one step. When used with "-inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat", Hadoop correctly unpacks the SequenceFile, but then misinterprets the typedbytes contained within, in exactly the same way.

此外,我找不到这个功能的文档。在2月7日(我把它通过电子邮件发送给自己),它是在<一个简要提及href=\"http://hadoop.apache.org/docs/ma$p$pduce/r0.21.0/streaming.html#Specifying+the+Communication+Format\"相对=nofollow> streaming.html页面上的Apache ,但这r0.21.0网页已被取下来,的为r1.1.1 相当于页没有提及rawbytes或typedbytes的。

Moreover, I can find no documentation of this feature. On Feb 7 (I e-mailed it to myself), it was briefly mentioned in the streaming.html page on Apache, but this r0.21.0 webpage has since been taken down and the equivalent page for r1.1.1 has no mention of rawbytes or typedbytes.

所以我的问题是:什么是使用rawbytes或typedbytes在Hadoop的数据流的正确方法?有没有人得到它的工作?如果是这样,可能有人张贴配方?看起来这将是任何人谁愿意在Hadoop的流媒体使用的二进制数据,这应该是一个相当广泛的群体的一个问题。

So my question is: what is the correct way to use rawbytes or typedbytes in Hadoop Streaming? Has anyone ever gotten it to work? If so, could someone post a recipe? It seems like this would be a problem for anyone who wants to use binary data in Hadoop Streaming, which ought to be a fairly broad group.

P.S。我注意到,小飞,的 Hadoopy RMR 所有使用此功能,但应该有一个办法直接使用它,而不会被一个基于Python或基于R-​​框架内介导的。

P.S. I noticed that Dumbo, Hadoopy, and rmr all use this feature, but there ought to be a way to use it directly, without being mediated by a Python-based or R-based framework.

推荐答案

好吧,我发现工作的组合,但它的怪异。

Okay, I've found a combination that works, but it's weird.


  1. prepare有效typedbytes文件在你的本地文件系统,下面的文件或通过模仿 typedbytes.py

使用

hadoop jar path/to/streaming.jar loadtb path/on/HDFS.sequencefile < local/typedbytes.tb

来包装typedbytes在SequenceFile并把它放在HDFS,一步到位。

to wrap the typedbytes in a SequenceFile and put it in HDFS, in one step.

使用

hadoop jar path/to/streaming.jar -inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat ...

运行图,减少工作中的映射器从SequenceFile获取输入。需要注意的是 -io typedbytes -D stream.map.input = typedbytes 应的的使用---明确要求typedbytes导致我在我的问题描述的misinter pretation。但是不要害怕,Hadoop的数据流分割上的二进制记录边界的输入,而不是它的'\\ n'字符。这些数据由'\\ T'和'\\ n',分离这样的映射器为rawdata中到:

to run a map-reduce job in which the mapper gets input from the SequenceFile. Note that -io typedbytes or -D stream.map.input=typedbytes should not be used--- explicitly asking for typedbytes leads to the misinterpretation I described in my question. But fear not: Hadoop Streaming splits the input on its binary record boundaries and not on its '\n' characters. The data arrive in the mapper as "rawdata" separated by '\t' and '\n', like this:


  1. 32位有符号整数,再presenting长(注:无型性格)

  2. 原始二进制与长块:这是关键

  3. '\\ T'(制表符...为什么?)

  4. 32位有符号整数,再presenting长度

  5. 原始二进制与长块:这是值

  6. 的'\\ n'(换行符...?)

如果您想从另外发送映射原始数据减速,加上

  • If you want to additionally send raw data from mapper to reducer, add

    -D stream.map.output=typedbytes -D stream.reduce.input=typedbytes
    

    您的Hadoop命令行和格式化映射器的输出和减速的预期投入为有效typedbytes。他们还为备用键 - 值对,但这次类型的字符和没有'\\ T'和'\\ n'。 Hadoop的数据流通过按键正确分割他们的二进制记录边界和组这些对。

    to your Hadoop command line and format the mapper's output and reducer's expected input as valid typedbytes. They also alternate for key-value pairs, but this time with type characters and without '\t' and '\n'. Hadoop Streaming correctly splits these pairs on their binary record boundaries and groups by keys.

    stream.map.output stream.reduce.input ,我能找到是唯一的文档 HADOOP-1722 的交换,从2月6日09(前面的讨论中被认为是不同的方式参数化格式。)

    The only documentation on stream.map.output and stream.reduce.input that I could find was in the HADOOP-1722 exchange, starting 6 Feb 09. (Earlier discussion considered a different way to parameterize the formats.)

    本食谱不提供输入强类型:类型字符某处失去了在创建SequenceFile,跨$ P $与 -inputformat 。它,然而,在二进制记录边界提供拆分,而不是'\\ n',这是真正重要的事情,强类型映射器和减速机之间。

    This recipe does not provide strong typing for the input: the type characters are lost somewhere in the process of creating a SequenceFile and interpreting it with the -inputformat. It does, however, provide splitting at the binary record boundaries, rather than '\n', which is the really important thing, and strong typing between the mapper and the reducer.

    这篇关于如何使用&QUOT; typedbytes&QUOT;或QUOT; rawbytes&QUOT;在Hadoop的流媒体?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    相关文章
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆