Hadoop排序输入顺序 [英] Hadoop sort input order

查看:309
本文介绍了Hadoop排序输入顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我的作业输入是文件集[a,b,c,d],严格地说是排序的输入[map(a.0),map(a.1),map(b.0) ,地图(b.1),地图(c.0),地图(c.1),地图(d.0),地图(d.1)]?

<我的动机是有一系列的文件(当然会被分解成块),这些文件的行是[key,value];每个键和值都是一个简单的字符串。尽管没有明确的订单定义字段,但我希望按照它们在输入中出现的顺序将这些值连接在每个键的缩减器中。



任何建议非常感激;这被证明是一个困难的谷歌查询。



示例



输入格式

  A第一个
A另一个
第三个
B第一个
C第一个
C另一个

所需输出

 首先,另一个,第三个
B首先
C首先,另一个

重申一下,我不确定我是否可以依靠以正确的顺序获得First-Third,因为文件存储在不同的块中。 解决方案此问题的一个解决方案是使用文件中TextInputFormat的字节偏移作为组合键的一部分,并使用辅助排序来确保将值按顺序发送给reducer。通过这种方式,您可以确保减速器看到输入按您所需的键按文件中的顺序分配。如果您有多个输入文件,那么这种方法将不会工作,因为每个新文件都会重置字节计数器。



使用流API时,您需要传递 -inputformat TextInputFormat -D stream.map.input.ignoreKey = false 添加到作业中,以便实际获得字节偏移量作为关键字(默认如果输入格式为TextInputFormat,则PipeMapper不会为您提供密钥。即使您明确设置了TextInputFormat标志所以你需要设置额外的ignoreKey标志)。

如果您从映射器发出多个键,请务必设置以下标志,以便您的输出在第一个键上分区并按第一个键和第二个键排序减速器:

  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner 
-D stream.num。 map.output.key.fields = 2
-D mapred.text.key.partitioner.options = - k1,1
-D mapred.output.key.comparator.class =org。 apache.hadoop.mapred.lib.KeyFieldBasedComparator
-D mapreduce.partition.keycomparator.options = - k1 -k2n


If the input to my job is the fileset [a, b, c, d], is the input to the sort strictly [map(a.0), map(a.1), map(b.0), map(b.1), map(c.0), map(c.1), map(d.0), map(d.1)]?

My motivation is having a series of files (which will of course be broken up into blocks) whose rows are [key, value]; where each of key and value are a simple string. I wish to concatenate these values together in the reducer per key in the order they are present in the input, despite there not being an explicit order-defining field.

Any advice much appreciated; this is proving to be a difficult query to Google for.

Example

Input format

A First
A Another
A Third
B First
C First
C Another

Desired output

A First,Another,Third
B First
C First,Another

To reiterate, I'm uncertain if I can rely on getting First-Third in the correct order given files are being stored in separate blocks.

解决方案

One solution to this issue is to make use the TextInputFormat's byte offset in the file as part of a composite key, and use a secondary sort to make sure the values are sent to the reducer in order. That way you can make sure the reducer sees input partioned by the key you want in the order it came in the file. If you have multiple input files, then this approach will not work as each new file will reset the byte counter.

With the streaming API you'll need to pass -inputformat TextInputFormat -D stream.map.input.ignoreKey=false to the job so that you actually get the byte offsets as the key (by default the PipeMapper won't give you keys if the inputformat is TextInputFormat.. even if you explicitly set the TextInputFormat flag so you need to set the additional ignoreKey flag).

If you're emitting multiple keys from a mapper, be sure to set the following flags so your output is partitioned on the first key and sorted on the first and second in the reducer:

-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
-D stream.num.map.output.key.fields=2
-D mapred.text.key.partitioner.options="-k1,1"
-D mapred.output.key.comparator.class="org.apache.hadoop.mapred.lib.KeyFieldBasedComparator"
-D mapreduce.partition.keycomparator.options="-k1 -k2n"

这篇关于Hadoop排序输入顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆