mapreduce如何排序和洗牌? [英] How does mapreduce sort and shuffle work?
问题描述
我正在使用yelps MRJob库来实现map-reduce功能。我知道map reduce有一个内部排序和随机算法,它根据它们的键对值进行排序。因此,如果我在地图阶段后得到以下结果
(1,24)(4,25)(3,26)
我知道sort和shuffle阶段会产生如下输出:
(1,24)(3,26)(4,25)
如预期的那样
但是如果我有两个相似的键和不同的值,为什么sort和shuffle阶段会根据第一个出现的值是什么?
例如
如果我有以下来自mapper的值列表
<$ (1,24,24)(1,24,23)(1,24,24)(1,24,24)(1,24,23)(1,<24,23>)
预期的产出是 但我得到的输出是 是这个MRjob图书馆具体点吗?无论如何,要根据数值停止这种排序吗? 本地MRjob只是使用操作系统在映射器输出中'sort'。 映射器以格式写入: 因此,您最终得到的键主要是按键排序的,但其次是按键排序。 如上所述,不会发生在真正的hadoop版本中,只是'本地'模拟。 I am using yelps MRJob library for achieving map-reduce functionality. I know that map reduce has an internal sort and shuffle algorithm which sorts the values on the basis of their keys. So if I have the following results after map phase I know the sort and shuffle phase will produce following output Which is as expected But if I have two similar keys and different values why does the sort and shuffle phase sorts the data on the basis of first value that appears? For example
If I have the following list of values from mapper The expected output is But the output that I am getting is is this MRjob library specific? Is there anyway to stop this sorting on the basis of values?? CODE
The local MRjob just uses the operating system 'sort' on the mapper output. The mapper writes out in the format: Thus you end up with the keys sorted primarily by key, but secondarily by value. As noted, this doesn't happen in the real hadoop version, just the 'local' simulation. 这篇关于mapreduce如何排序和洗牌?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
$ $ p $ (1,24,23)(1,23,24)(2,25,26)
code $ pre
pre $ (1,<23,24>)(1 ,24,23)(2,<25,26>)
from mrjob.job import MRJob
import math
class SortMR(MRJob):
def steps(self):
return [
self.mr(mapper = self.rangemr,
reducer = self.rangesort)]
def rangemr(self,key,line):
for line.split():
yield 1,a
def rangesort(self,
for a line:
yield(1,a)
if __name__ =='__main__':
SortMR。 run()
key <-tab-> value \\\
(1, 24) (4, 25) (3, 26)
(1, 24) (3, 26) (4, 25)
(2, <25, 26>) (1, <24, 23>) (1, <23, 24>)
(1, <24, 23>) (1, <23, 24>) (2, <25, 26>)
(1, <23, 24>) (1, <24, 23>) (2, <25, 26>)
from mrjob.job import MRJob
import math
class SortMR(MRJob):
def steps(self):
return [
self.mr(mapper=self.rangemr,
reducer=self.rangesort)]
def rangemr(self, key, line):
for a in line.split():
yield 1,a
def rangesort(self,numid,line):
for a in line:
yield(1, a)
if __name__ == '__main__':
SortMR.run()
key<-tab->value\n