Map-Reduce/Hadoop按整数值排序(使用MRJob) [英] Map-Reduce/Hadoop sort by integer value (using MRJob)

查看:144
本文介绍了Map-Reduce/Hadoop按整数值排序(使用MRJob)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个简单的Map-Reduce排序功能的MRJob实现.在beta.py:

This is an MRJob implementation of a simple Map-Reduce sorting functionality. In beta.py:

from mrjob.job import MRJob

class Beta(MRJob):
    def mapper(self, _, line):
        """
        """
        l = line.split(' ')
        yield l[1], l[0]

    def reducer(self, key, val):
        yield key, [v for v in val][0]


if __name__ == '__main__':
    Beta.run()

我使用以下文字运行它:

I run it using the text:

1 1
2 4
3 8
4 2
4 7
5 5
6 10
7 11

一个人可以使用以下命令来运行它:

One can run this using:

cat <filename> | python beta.py

现在的问题是,假设键的类型为string,则对输出进行排序(在这里可能就是这种情况).输出为:

Now the issue is the output is sorted assuming that the key is of type string (which is probably the case here). The output is:

"1"     "1"
"10"    "6"
"11"    "7"
"2"     "4"
"4"     "2"
"5"     "5"
"7"     "4"
"8"     "3"

我想要的输出是:

"1"     "1"
"2"     "4"
"4"     "2"
"5"     "5"
"7"     "4"
"8"     "3"
"10"    "6"
"11"    "7"

我不确定这是否与MRJob中的协议摆弄有关,因为协议是特定于工作的,而不是特定于步骤的.

I am not sure if this is to do with fiddling with protocols in MRJob as protocols are job specific and not step specific.

编辑(解决方案):我已经找到了答案.这个想法是,每个数字都必须以'O-bytes'开头,以便每个数字中的字节数与最大数字中的字节数相同.至少那是我在课堂上记得的东西.我现在无法添加答案,因为它不允许我这样做,但这是我唯一的解决方案.如果有人能获得更透明,更轻松的信息,请分享.

EDIT (Solution): I have got the answer for this one. The idea is that one needs to prepend 'O-bytes' to every number such that the number of bytes in every number is same the number of bytes in the largest number. At least that's what I remembered from my classes. I cannot add the answer right now as it won't permit me but this is the only solution I've got. If anyone's got something more transparent and easy, please share.

推荐答案

简单的解决方案(更强大的功能可能是基于调整Hadoop对映射器输出进行排序的方式)

Simple solution (more robust might be based on tuning how Hadoop is sorting mapper output)

class Beta(MRJob):

    def mapper (self, _, line):
        l = line.strip('\n').split()
        yield '%010d'%int(l[1]), l[0]

    def reducer(self, key, values):
        yield int(key),int(list(values)[0])

这篇关于Map-Reduce/Hadoop按整数值排序(使用MRJob)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆