mapreduce如何排序和洗牌？ [英] How does mapreduce sort and shuffle work?

查看：148 发布时间：2018/5/31 19:34:09 hadoop mapreduce mrjob

本文介绍了mapreduce如何排序和洗牌？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用yelps MRJob库来实现map-reduce功能。我知道map reduce有一个内部排序和随机算法，它根据它们的键对值进行排序。因此，如果我在地图阶段后得到以下结果

 （1,24）（4,25）（3,26）

我知道sort和shuffle阶段会产生如下输出：

 （1,24）（3,26）（4,25）

如预期的那样

但是如果我有两个相似的键和不同的值，为什么sort和shuffle阶段会根据第一个出现的值是什么？

例如
如果我有以下来自mapper的值列表

<$ （1,24,24）（1,24,23）（1,24,24）（1,24,24）（1，24,23）（1，<24,23>）

预期的产出是

$ $ p $ （1,24,23）（1,23,24）（2,25,26） code $ pre

但我得到的输出是 pre $ （1，<23,24>）（1 ，24,23）（2，<25,26>）

是这个MRjob图书馆具体点吗？无论如何，要根据数值停止这种排序吗？

          pre>  from mrjob.job import MRJob 
 import math 
 
 class SortMR（MRJob）：
 
 
 
 def steps（self）：
 return [
 self.mr（mapper = self.rangemr，
 reducer = self.rangesort）] 
 
 
 def rangemr（self，key，line）：
 for line.split（）：
 yield 1，a 
 
 
 def rangesort（self， 
 for a line：
 yield（1，a）
 
 
 if __name__ =='__main__'：
 SortMR。 run（）
   
 
解决方案
本地MRjob只是使用操作系统在映射器输出中'sort'。
 
 
映射器以格式写入：
 
 key <-tab-> value \\\
 
 
 
 因此，您最终得到的键主要是按键排序的，但其次是按键排序。
 
 
 如上所述，不会发生在真正的hadoop版本中，只是'本地'模拟。
 
I am using yelps MRJob library for achieving map-reduce functionality. I know that map reduce has an internal sort and shuffle algorithm which sorts the values on the basis of their keys. So if I have the following results after map phase
(1, 24) (4, 25) (3, 26)
I know the sort and shuffle phase will produce following output
(1, 24) (3, 26) (4, 25)
Which is as expected

But if I have two similar keys and different values why does the sort and shuffle phase  sorts the data on the basis of first value that appears?

For example
If I have the following list of values from mapper
(2, <25, 26>) (1, <24, 23>) (1, <23, 24>) 
The expected output is
(1, <24, 23>) (1, <23, 24>) (2, <25, 26>)
But the output that I am getting is
(1, <23, 24>) (1, <24, 23>) (2, <25, 26>)
is this MRjob library specific? Is there anyway to stop this sorting on the basis of values??

CODE
from mrjob.job import MRJob
import math

class SortMR(MRJob):



def steps(self):
    return [
        self.mr(mapper=self.rangemr,
                reducer=self.rangesort)]


def rangemr(self, key, line):
    for a in line.split():
        yield 1,a


def rangesort(self,numid,line):
    for a in line:
        yield(1, a)


if __name__ == '__main__':
    SortMR.run()

 解决方案 
The local MRjob just uses the operating system 'sort' on the mapper output.

The mapper writes out in the format: 

 key<-tab->value\n

Thus you end up with the keys sorted primarily by key, but secondarily by value.

As noted, this doesn't happen in the real hadoop version, just the 'local' simulation.

                        这篇关于mapreduce如何排序和洗牌？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

mapreduce如何排序和洗牌？ [英] How does mapreduce sort and shuffle work?

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

mapreduce如何排序和洗牌？ [英] How does mapreduce sort and shuffle work?

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭