MapReduce 排序算法是如何工作的? [英] How does the MapReduce sort algorithm work?

查看:29
本文介绍了MapReduce 排序算法是如何工作的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

用于展示 MapReduce 功能的主要示例之一是 Terasort 基准测试.我无法理解 MapReduce 环境中使用的排序算法的基础知识.

One of the main examples that is used in demonstrating the power of MapReduce is the Terasort benchmark. I'm having trouble understanding the basics of the sorting algorithm used in the MapReduce environment.

对我来说,排序只是确定一个元素相对于所有其他元素的相对位置.所以排序涉及将一切"与一切"进行比较.您的平均排序算法(快速、冒泡、...)只是以一种聪明的方式做到了这一点.

To me sorting simply involves determining the relative position of an element in relationship to all other elements. So sorting involves comparing "everything" with "everything". Your average sorting algorithm (quick, bubble, ...) simply does this in a smart way.

在我看来,将数据集分成许多部分意味着您可以对单个部分进行排序,然后您仍然必须将这些部分整合到完整"的完全排序的数据集中.鉴于分布在数千个系统上的 TB 级数据集,我预计这是一项艰巨的任务.

In my mind splitting the dataset into many pieces means you can sort a single piece and then you still have to integrate these pieces into the 'complete' fully sorted dataset. Given the terabyte dataset distributed over thousands of systems I expect this to be a huge task.

那么这到底是怎么做的呢?这个 MapReduce 排序算法是如何工作的?

So how is this really done? How does this MapReduce sorting algorithm work?

谢谢你帮助我理解.

推荐答案

这里有一些关于 Hadoop 对 Terasort 的实施的详细信息:

TeraSort 是标准的 map/reduce 排序,除了自定义分区器使用 N-1 个采样键的排序列表定义每个 reduce 的键范围.特别地,所有使得 sample[i − 1] <= key <;发送样本 [i] 以减少 i.这保证了reduce i的输出都小于reduce i+1的输出."

TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."

所以他们的技巧在于他们在映射阶段确定键的方式.本质上,它们确保单个减速器中的每个值都保证针对所有其他减速器进行预排序".

So their trick is in the way they determine the keys during the map phase. Essentially they ensure that every value in a single reducer is guaranteed to be 'pre-sorted' against all other reducers.

我通过 James Hamilton 的博客文章找到了论文参考.

I found the paper reference through James Hamilton's Blog Post.

这篇关于MapReduce 排序算法是如何工作的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆