MapReduce作业输出排序顺序 [英] MapReduce job Output sort order

查看:521
本文介绍了MapReduce作业输出排序顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以在我的mapreduce作业中看到reducer部分的输出是按键排序的。所以如果我将reducer的数量设置为10,输出目录将包含10个文件,每个输出文件都有一个排序数据。



我将它放在这里的原因是,即使所有文件都有排序数据,但这些文件本身并没有排序。例如:
:有部分000 *文件从0开始并以zzzz结尾的情况,假设我使用Text作为键。



我假设文件应该在文件中排序,即文件1应该有一个和最后一个文件部分 - 00009应该有zzzz或atleaset条目> a



假设我是否拥有所有字母表的统一分配键。

有人可能会抛出一些为什么会有这样的行为$ b $你可以实现一个全局排序的文件(这就是你基本想要的)u唱这些方法:


  1. 在mapreduce中只使用一个reducer(坏主意!这使得在一台机器上工作太多)

  2. 编写一个自定义的分区程序。 Partioner是在mapreduce中划分关键空间的类。默认分区(散列分配器)将关键空间平均分成减数的数量。查看这个编写自定义分割器的例子。


  3. 使用Hadoop Pig / Hive进行排序。


i can see in my mapreduce jobs that the output of the reducer part is sorted by key ..

so if i have set number of reducers to 10, the output directory would contain 10 files and each of those output files have a sorted data.

the reason i am putting it here is that even though all the files have sorted data but these files itself are not sorted.. for example : there are scenarios where the part-000* files have started from 0 and end at zzzz assuming i am using Text as the key.

i was assumming that the file's should be sorted even within the files i.e file 1 should have a and the last file part--00009 should have entries with zzzz or atleaset > a

assuming if i have all the alphabets uniformally distributed keys.

could someone throw some light why such a behavior

解决方案

You can achieve a globally sorted file (which is what you basically want) using these methods:

  1. Use just one reducer in mapreduce (bad idea !! This puts too much work on one machine)
  2. Write a custom partitioner. Partioner is the class which divides the key space in mapreduce. The default partioner (Hashpartioner) evenly divides the key space into the number of reducers. Check out this example for writing a custom partioner.

  3. Use Hadoop Pig/Hive to do sort.

这篇关于MapReduce作业输出排序顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆