使用MapReduce / Hadoop对大数据进行排序 [英] Sorting large data using MapReduce/Hadoop

查看:321
本文介绍了使用MapReduce / Hadoop对大数据进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



假设我们有一个包含100万个条目(整数)的文件,我们想要使用它来排序它们MapReduce的。我理解的方式如下:



编写一个排序整数的映射函数。因此,框架将输入文件分成多个块,并将它们分配给不同的映射器。每个映射器将对彼此独立的数据块进行排序。一旦所有的mapper都完成了,我们会将他们的每个结果传递给Reducer,它会结合结果并给我最终的输出。



我的疑问是,如果我们有一个reducer,那么它如何利用分布式框架,如果最终我们不得不在一个地方结合结果?这个问题深入到在一个地方合并100万个条目。是这样或我缺少什么?



谢谢,
Chander

解决方案结束合并排序。



事实证明,排序部分排序列表在操作和内存消耗方面比排序完整list。



如果reducer获得4个排序列表,它只需要查找4个列表中最小的元素并选择它。如果列表数量不变,这个减少操作就是一个O(N)操作。

通常情况下,还原器也是分布式的,就像树一样,所以工作也可以并行化。


I am reading about MapReduce and the following thing is confusing me.

Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows:

Write a mapper function that sorts integers. So the framework will divide the input file into multiple chunks and would give them to different mappers. Each mapper will sort their chunk of data independent of each other. Once all the mappers are done, we will pass each of their results to Reducer and it will combine the result and give me the final output.

My doubt is, if we have one reducer, then how does it leverage the distributed framework, if, eventually, we have to combine the result at one place?. The problem drills down to merging 1 million entries at one place. Is that so or am i missing something?

Thanks, Chander

解决方案

Check out merge-sort.

It turns out that sorting partially sorted lists is much more efficient in terms of operations and memory consumption than sorting the complete list.

If the reducer gets 4 sorted lists it only needs to look for the smallest element of the 4 lists and pick that one. If the number of lists is constant this reducing is an O(N) operation.

Also typically the reducers are also "distributed" in something like a tree, so the work can be parrallelized too.

这篇关于使用MapReduce / Hadoop对大数据进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆