如何在hadoop的shuffle / sort阶段进行数字排序？ [英] how to sort numerically in hadoop's shuffle/sort phase?

查看：694 发布时间：2018/5/31 18:41:35 sorting hadoop

本文介绍了如何在hadoop的shuffle / sort阶段进行数字排序？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

数据看起来像这样，第一个字段是一个数字，

  3 ... 
 1 .. 。
 2 ... 
 11 ...

我想根据第一个字段按照第一个字段的顺序对这些行进行排序，而不是按字母顺序排序，这意味着排序之后它应该看起来像这样，

  1 .. 。
 2 ... 
 3 ... 
 11 ...

但hadoop一直给我这个，

  1 ... 
 11 ... 
 2 ... 
 3 ...

如何纠正它？

解决方案

假设您使用的是 Hadoop流式传输，您需要使用 KeyFieldBasedComparator 类。

-D mapred.output.key.comparator.class = org.apache.hadoop.mapred.lib.KeyFieldBasedComparator should被添加到流命令中

您需要使用mapred.text.key.comparator.opt提供所需的排序类型离子。一些有用的是-n：数字排序，-r：反向排序

示例：

使用以下代码创建标识映射器和缩减器：

这是 mapper.py & reducer.py
＃！/ usr / bin / env python import sys 代表sys.stdin中的行： print％s％（line.strip（））
这是 input.txt

1 11 2 20 7 3 40
这是 Streaming 命令
$ HADOOP_HOME / bin / hadoop jar $ HADOOP_HOME / hadoop -streaming.jar -D mapred.output.key.comparator.class = org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -D mapred.text.key.comparator.options = -n -input /user/input.txt -output /user/output.txt -file〜/ mapper.py -mapper〜/ mapper.py - file〜/ reducer.py -reducer〜/ reducer.py
你会得到所需的产量

1 2 3 7 11 20 40
注意 trong>：

我已经使用了一个简单的键输入。但是，如果您有多个键和/或分区，则必须根据需要编辑mapred.text.key.comparator.options。因为我不知道你的用例，所以我的示例仅限于这个

标识映射器是必需的，因为您需要至少一个映射器来运行MR作业。

身份缩减器是必需的，因为如果纯粹的仅映射作业，混洗/排序阶段将不起作用。

The data looks like this, first field is a number,
3 ... 1 ... 2 ... 11 ...
And I want to sort these lines according to the first field numerically instead of alphabetically, which means after sorting it should look like this,
1 ... 2 ... 3 ... 11 ...
But hadoop keeps giving me this,
1 ... 11 ... 2 ... 3 ...
How do correct it?
解决方案
Assuming you are using Hadoop Streaming, you need to use the KeyFieldBasedComparator class.

-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator should be added to streaming command

You need to provide type of sorting required using mapred.text.key.comparator.options. Some useful ones are -n : numeric sort, -r : reverse sort

EXAMPLE :

Create an identity mapper and reducer with the following code

This is the mapper.py & reducer.py
#!/usr/bin/env python import sys for line in sys.stdin: print "%s" % (line.strip())
This is the input.txt
1 11 2 20 7 3 40
This is the Streaming command
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -D mapred.text.key.comparator.options=-n -input /user/input.txt -output /user/output.txt -file ~/mapper.py -mapper ~/mapper.py -file ~/reducer.py -reducer ~/reducer.py
And you will get the required output
1 2 3 7 11 20 40
NOTE :

I have used a simple one key input. If however you have multiple keys and/or partitions, you will have to edit mapred.text.key.comparator.options as needed. Since I do not know your use case , my example is limited to this

Identity mapper is needed since you will need atleast one mapper for a MR job to run.

Identity reducer is needed since shuffle/sort phase will not work if it is a pure map only job.

这篇关于如何在hadoop的shuffle / sort阶段进行数字排序？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在hadoop的shuffle / sort阶段进行数字排序？ [英] how to sort numerically in hadoop's shuffle/sort phase?

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

如何在hadoop的shuffle / sort阶段进行数字排序？ [英] how to sort numerically in hadoop&#39;s shuffle/sort phase?

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

如何在hadoop的shuffle / sort阶段进行数字排序？ [英] how to sort numerically in hadoop's shuffle/sort phase?

登录关闭