MapReduce 作业输出排序顺序 [英] MapReduce job Output sort order

查看:23
本文介绍了MapReduce 作业输出排序顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以在我的 mapreduce 作业中看到,reducer 部分的输出按键排序..

i can see in my mapreduce jobs that the output of the reducer part is sorted by key ..

因此,如果我将减速器的数量设置为 10,则输出目录将包含 10 个文件,并且每个输出文件都有一个已排序的数据.

so if i have set number of reducers to 10, the output directory would contain 10 files and each of those output files have a sorted data.

我把它放在这里的原因是,即使所有文件都对数据进行了排序,但这些文件本身并没有被排序.例如:假设我使用 Text 作为键,则在某些情况下,part-000* 文件从 0 开始并以 zzzz 结束.

the reason i am putting it here is that even though all the files have sorted data but these files itself are not sorted.. for example : there are scenarios where the part-000* files have started from 0 and end at zzzz assuming i am using Text as the key.

我假设即使在文件中也应该对文件进行排序,即文件 1 应该有一个,最后一个文件部分--00009 应该有带有 zzzz 或 atleaset > a 的条目

i was assumming that the file's should be sorted even within the files i.e file 1 should have a and the last file part--00009 should have entries with zzzz or atleaset > a

假设我有所有字母均匀分布的键.

assuming if i have all the alphabets uniformally distributed keys.

有人可以解释一下为什么会有这种行为

could someone throw some light why such a behavior

推荐答案

您可以使用以下方法实现全局排序的文件(这正是您想要的):

You can achieve a globally sorted file (which is what you basically want) using these methods:

  1. 在 mapreduce 中只使用一个 reducer(坏主意!!这会在一台机器上做太多工作)
  2. 编写自定义分区程序.Partioner是mapreduce中划分key空间的类.默认分区器(Hashpartioner)将key空间平均划分为reducer的数量.查看示例以编写自定义分区程序.

  1. Use just one reducer in mapreduce (bad idea !! This puts too much work on one machine)
  2. Write a custom partitioner. Partioner is the class which divides the key space in mapreduce. The default partioner (Hashpartioner) evenly divides the key space into the number of reducers. Check out this example for writing a custom partioner.

使用 Hadoop Pig/Hive 进行排序.

Use Hadoop Pig/Hive to do sort.

这篇关于MapReduce 作业输出排序顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆