使用 MapReduce/Hadoop 对大数据进行排序 [英] Sorting large data using MapReduce/Hadoop

查看:24
本文介绍了使用 MapReduce/Hadoop 对大数据进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读有关 MapReduce 的内容,但以下内容让我感到困惑.

I am reading about MapReduce and the following thing is confusing me.

假设我们有一个包含 100 万个条目(整数)的文件,我们想使用 MapReduce 对它们进行排序.我理解的方式如下:

Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows:

编写一个对整数进行排序的映射器函数.所以框架会将输入文件分成多个块,并将它们提供给不同的映射器.每个映射器将彼此独立地对它们的数据块进行排序.完成所有映射器后,我们会将它们的每个结果传递给 Reducer,它会合并结果并给出最终输出.

Write a mapper function that sorts integers. So the framework will divide the input file into multiple chunks and would give them to different mappers. Each mapper will sort their chunk of data independent of each other. Once all the mappers are done, we will pass each of their results to Reducer and it will combine the result and give me the final output.

我的疑问是,如果我们有一个 reducer,那么它如何利用分布式框架,如果最终我们必须将结果合并到一个地方?.问题深入到在一个地方合并 100 万个条目.是这样还是我错过了什么?

My doubt is, if we have one reducer, then how does it leverage the distributed framework, if, eventually, we have to combine the result at one place?. The problem drills down to merging 1 million entries at one place. Is that so or am i missing something?

谢谢,钱德

推荐答案

查看合并排序.

事实证明,在操作和内存消耗方面,对部分排序的列表进行排序比对完整列表进行排序要高效得多.

It turns out that sorting partially sorted lists is much more efficient in terms of operations and memory consumption than sorting the complete list.

如果reducer得到4个排序列表,它只需要寻找4个列表中最小的元素并选择那个.如果列表的数量不变,则此减少操作是 O(N) 操作.

If the reducer gets 4 sorted lists it only needs to look for the smallest element of the 4 lists and pick that one. If the number of lists is constant this reducing is an O(N) operation.

通常,reducer 也分布"在树之类的东西中,因此工作也可以并行化.

Also typically the reducers are also "distributed" in something like a tree, so the work can be parrallelized too.

这篇关于使用 MapReduce/Hadoop 对大数据进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆