如何在hadoop的shuffle / sort阶段进行数字排序? [英] how to sort numerically in hadoop's shuffle/sort phase?
问题描述
数据看起来像这样,第一个字段是一个数字,
3 ...
1 .. 。
2 ...
11 ...
我想根据第一个字段按照第一个字段的顺序对这些行进行排序,而不是按字母顺序排序,这意味着排序之后它应该看起来像这样,
1 .. 。
2 ...
3 ...
11 ...
但hadoop一直给我这个,
1 ...
11 ...
2 ...
3 ...
如何纠正它?
假设您使用的是 Hadoop流式传输,您需要使用 KeyFieldBasedComparator 类。
-
-D mapred.output.key.comparator.class = org.apache.hadoop.mapred.lib.KeyFieldBasedComparator should被添加到流命令中
-
您需要使用mapred.text.key.comparator.opt提供所需的排序类型离子。一些有用的是-n:数字排序,-r:反向排序
示例 :
使用以下代码创建标识映射器和缩减器:
这是 mapper.py & reducer.py
#!/ usr / bin / env python
import sys
代表sys.stdin中的行:
print%s%(line.strip())
这是 input.txt
1
11
2
20
7
3
40
这是 Streaming 命令
$ HADOOP_HOME / bin / hadoop jar $ HADOOP_HOME / hadoop -streaming.jar
-D mapred.output.key.comparator.class = org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
-D mapred.text.key.comparator.options = -n
-input /user/input.txt
-output /user/output.txt
-file〜/ mapper.py
-mapper〜/ mapper.py
- file〜/ reducer.py
-reducer〜/ reducer.py
你会得到所需的产量
1
2
3
7
11
20
40
注意 trong>:
-
我已经使用了一个简单的键输入。但是,如果您有多个键和/或分区,则必须根据需要编辑mapred.text.key.comparator.options。因为我不知道你的用例,所以我的示例仅限于这个
-
标识映射器是必需的,因为您需要至少一个映射器来运行MR作业。
-
身份缩减器是必需的,因为如果纯粹的仅映射作业,混洗/排序阶段将不起作用。
The data looks like this, first field is a number,
3 ...
1 ...
2 ...
11 ...
And I want to sort these lines according to the first field numerically instead of alphabetically, which means after sorting it should look like this,
1 ...
2 ...
3 ...
11 ...
But hadoop keeps giving me this,
1 ...
11 ...
2 ...
3 ...
How do correct it?
Assuming you are using Hadoop Streaming, you need to use the KeyFieldBasedComparator class.
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator should be added to streaming command
You need to provide type of sorting required using mapred.text.key.comparator.options. Some useful ones are -n : numeric sort, -r : reverse sort
EXAMPLE :
Create an identity mapper and reducer with the following code
This is the mapper.py & reducer.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
print "%s" % (line.strip())
This is the input.txt
1
11
2
20
7
3
40
This is the Streaming command
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
-D mapred.text.key.comparator.options=-n
-input /user/input.txt
-output /user/output.txt
-file ~/mapper.py
-mapper ~/mapper.py
-file ~/reducer.py
-reducer ~/reducer.py
And you will get the required output
1
2
3
7
11
20
40
NOTE :
I have used a simple one key input. If however you have multiple keys and/or partitions, you will have to edit mapred.text.key.comparator.options as needed. Since I do not know your use case , my example is limited to this
Identity mapper is needed since you will need atleast one mapper for a MR job to run.
Identity reducer is needed since shuffle/sort phase will not work if it is a pure map only job.
这篇关于如何在hadoop的shuffle / sort阶段进行数字排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!