如何在hadoop的shuffle / sort阶段进行数字排序? [英] how to sort numerically in hadoop's shuffle/sort phase?

查看:694
本文介绍了如何在hadoop的shuffle / sort阶段进行数字排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

数据看起来像这样,第一个字段是一个数字,

  3 ... 
1 .. 。
2 ...
11 ...

我想根据第一个字段按照第一个字段的顺序对这些行进行排序,而不是按字母顺序排序,这意味着排序之后它应该看起来像这样,

  1 .. 。
2 ...
3 ...
11 ...

但hadoop一直给我这个,

  1 ... 
11 ...
2 ...
3 ...

如何纠正它?

解决方案

假设您使用的是 Hadoop流式传输,您需要使用 KeyFieldBasedComparator 类。


  1. -D mapred.output.key.comparator.class = org.apache.hadoop.mapred.lib.KeyFieldBasedComparator should被添加到流命令中


  2. 您需要使用mapred.text.key.comparator.opt提供所需的排序类型离子。一些有用的是-n:数字排序,-r:反向排序


示例

使用以下代码创建标识映射器和缩减器:



这是 mapper.py & reducer.py

 #!/ usr / bin / env python 
import sys
代表sys.stdin中的行:
print%s%(line.strip())

这是 input.txt

  1 
11
2
20
7
3
40

这是 Streaming 命令

  $ HADOOP_HOME / bin / hadoop jar $ HADOOP_HOME / hadoop -streaming.jar 
-D mapred.output.key.comparator.class = org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
-D mapred.text.key.comparator.options = -n
-input /user/input.txt
-output /user/output.txt
-file〜/ mapper.py
-mapper〜/ mapper.py
- file〜/ reducer.py
-reducer〜/ reducer.py

你会得到所需的产量

  1 
2
3
7
11
20
40

注意 trong>:


  1. 我已经使用了一个简单的键输入。但是,如果您有多个键和/或分区,则必须根据需要编辑mapred.text.key.comparator.options。因为我不知道你的用例,所以我的示例仅限于这个


  2. 标识映射器是必需的,因为您需要至少一个映射器来运行MR作业。

  3. 身份缩减器是必需的,因为如果纯粹的仅映射作业,混洗/排序阶段将不起作用。



The data looks like this, first field is a number,

3 ...
1 ...
2 ...
11 ...

And I want to sort these lines according to the first field numerically instead of alphabetically, which means after sorting it should look like this,

1 ...
2 ...
3 ...
11 ...

But hadoop keeps giving me this,

1 ...
11 ...
2 ...
3 ...

How do correct it?

解决方案

Assuming you are using Hadoop Streaming, you need to use the KeyFieldBasedComparator class.

  1. -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator should be added to streaming command

  2. You need to provide type of sorting required using mapred.text.key.comparator.options. Some useful ones are -n : numeric sort, -r : reverse sort

EXAMPLE :

Create an identity mapper and reducer with the following code

This is the mapper.py & reducer.py

#!/usr/bin/env python
import sys
for line in sys.stdin:    
    print "%s" % (line.strip())

This is the input.txt

1
11
2
20
7
3
40

This is the Streaming command

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar 
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator 
-D  mapred.text.key.comparator.options=-n 
-input /user/input.txt 
-output /user/output.txt 
-file ~/mapper.py 
-mapper ~/mapper.py 
-file ~/reducer.py 
-reducer ~/reducer.py

And you will get the required output

1   
2   
3   
7   
11  
20  
40

NOTE :

  1. I have used a simple one key input. If however you have multiple keys and/or partitions, you will have to edit mapred.text.key.comparator.options as needed. Since I do not know your use case , my example is limited to this

  2. Identity mapper is needed since you will need atleast one mapper for a MR job to run.

  3. Identity reducer is needed since shuffle/sort phase will not work if it is a pure map only job.

这篇关于如何在hadoop的shuffle / sort阶段进行数字排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆