随机播放并排序为mapreduce [英] Shuffle and sort for mapreduce
问题描述
$ b
请看这张图
在Map和Reduce阶段为以上图片添加更多描述。
地图面:
当map函数开始产生输出时,它不会简单写入磁盘。在将输出写入映射到磁盘之前,线程第一个将数据划分为最终要发送到的与reducer
对应的分区。在每个分区中,后台线程通过键
执行内存中排序。
Reduce Side : 所有映射输出都已被复制时,reduce任务将进入排序阶段(应适当地称为合并阶段,因为排序在地图一侧进行),它合并地图输出,保持排序顺序。这将一轮完成。 资料来源:Hadoop权威指南。 I read through the definitive guide and some other links on the web including the one here My question is where exactly does shuffling and sorting happen? As per my understanding, they happen on both mappers and reducers. But some links mention that shuffling happens on mappers and sorting on reducers. Can someone confirm if my understanding is correct; if not can they provide additional documentation I can go through? Shuffle: MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort and Sort: Sorting happens in various stages of MapReduce program, So can exists in Map and Reduce phases. Please have a look at this diagram
Adding more description to above image in Map and Reduce phases. The Map Side: When the map function starts producing output, it is not simply written to disk. Before Map output writes to disk, the thread first The Reduce Side: When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side), which merges the map outputs, maintaining their sort ordering. This will be done in rounds. Source : Hadoop Definitive Guide. 这篇关于随机播放并排序为mapreduce的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
transfers map outputs to the reducers as inputs
is known as the shuffle.divides the data into partitions corresponding to the reducers
that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key
.