mapreduce如何排序和洗牌? [英] How does mapreduce sort and shuffle work?

查看:148
本文介绍了mapreduce如何排序和洗牌?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用yelps MRJob库来实现map-reduce功能。我知道map reduce有一个内部排序和随机算法,它根据它们的键对值进行排序。因此,如果我在地图阶段后得到以下结果

 (1,24)(4,25)(3,26)

我知道sort和shuffle阶段会产生如下输出:

 (1,24)(3,26)(4,25)

如预期的那样

但是如果我有两个相似的键和不同的值,为什么sort和shuffle阶段会根据第一个出现的值是什么?



例如
如果我有以下来自mapper的值列表

<$ (1,24,24)(1,24,23)(1,24,24)(1,24,24)(1,24,23)(1,<24,23>)

预期的产出是

$ $ p $ (1,24,23)(1,23,24)(2,25,26)
code $ pre

但我得到的输出是

pre $ (1,<23,24>)(1 ,24,23)(2,<25,26>)

是这个MRjob图书馆具体点吗?无论如何,要根据数值停止这种排序吗?

pre> from mrjob.job import MRJob
import math

class SortMR(MRJob):



def steps(self):
return [
self.mr(mapper = self.rangemr,
reducer = self.rangesort)]


def rangemr(self,key,line):
for line.split():
yield 1,a


def rangesort(self,
for a line:
yield(1,a)


if __name__ =='__main__':
SortMR。 run()


解决方案

本地MRjob只是使用操作系统在映射器输出中'sort'。

映射器以格式写入:

key <-tab-> value \\\



因此,您最终得到的键主要是按键排序的,但其次是按键排序。



如上所述,不会发生在真正的hadoop版本中,只是'本地'模拟。


I am using yelps MRJob library for achieving map-reduce functionality. I know that map reduce has an internal sort and shuffle algorithm which sorts the values on the basis of their keys. So if I have the following results after map phase

(1, 24) (4, 25) (3, 26)

I know the sort and shuffle phase will produce following output

(1, 24) (3, 26) (4, 25)

Which is as expected

But if I have two similar keys and different values why does the sort and shuffle phase sorts the data on the basis of first value that appears?

For example If I have the following list of values from mapper

(2, <25, 26>) (1, <24, 23>) (1, <23, 24>) 

The expected output is

(1, <24, 23>) (1, <23, 24>) (2, <25, 26>)

But the output that I am getting is

(1, <23, 24>) (1, <24, 23>) (2, <25, 26>)

is this MRjob library specific? Is there anyway to stop this sorting on the basis of values??

CODE

from mrjob.job import MRJob
import math

class SortMR(MRJob):



def steps(self):
    return [
        self.mr(mapper=self.rangemr,
                reducer=self.rangesort)]


def rangemr(self, key, line):
    for a in line.split():
        yield 1,a


def rangesort(self,numid,line):
    for a in line:
        yield(1, a)


if __name__ == '__main__':
    SortMR.run()

解决方案

The local MRjob just uses the operating system 'sort' on the mapper output.

The mapper writes out in the format:

key<-tab->value\n

Thus you end up with the keys sorted primarily by key, but secondarily by value.

As noted, this doesn't happen in the real hadoop version, just the 'local' simulation.

这篇关于mapreduce如何排序和洗牌?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆