mapreduce作业的map阶段的输出是否始终排序? [英] is the output of map phase of the mapreduce job always sorted?

查看:193
本文介绍了mapreduce作业的map阶段的输出是否始终排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



例如,当我使用这个输入文本运行一个简单的wordcount程序时:

>

  hello world 
Hadoop编程
mapreduce wordcount
让我们看看这个工程
12345678
hello world
mapreduce wordcount

这是我得到的输出: p>

  12345678 1 
Hadoop 1
hello 1
hello 1
if 1
让1
mapreduce 1
mapreduce 1
编程1
参阅1
这1
wordcount 1
wordcount 1
作品1
世界1
世界1

如您所见,来自mapper的已经排序。我根本没有运行 Reducer
但是我发现在一个不同的项目中,mapper的输出没有排序。
所以我完全清楚这一点。



我的问题是:


  1. 映射器的输出是否总是被排序?

  2. 排序阶段是否已经集成到映射器阶段,以便映射阶段的输出已经排序在中间数据中? li>
  3. 有没有办法从 sort和shuffle 阶段收集数据并在它进入Reducer之前将其保留下来?减速器提供了一个密钥和一个迭代列表。有没有办法,我可以坚持这些数据?


解决方案


映射器的输出是否总是排序?


没有。如果你不使用减速器,它不会被排序。如果您使用Reducer,则在映射器的输出写入磁盘之前会有一个预先排序的过程。数据在Reduce阶段被排序。这里发生的事情(只是一个猜测)是你没有指定一个Reducer类,它在新的API中被翻译成Identity Reducer(参见这个答案和评论)。 Identity Reducer只输出其输入。要验证这一点,请参阅默认的Reducer计数器(应该有一些减少任务,减少输入记录和组,减少输出记录......)





是否已将排序阶段集成到映射阶段中,以便映射阶段的输出已经排序在中间数据中?正如我在前面的问题中解释的那样,如果您不使用缩减器,映射器不会对数据进行排序。如果你使用reducer,数据开始从map阶段进行排序,然后在reduce阶段进行合并排序。


一种从sort和shuffle阶段收集数据的方法,并在它进入Reducer之前坚持它。减速器提供了一个密钥和一个迭代列表。有没有办法,我可以坚持这些数据?


再次,混洗和排序是Reduce阶段的一部分。一个Identity Reducer可以做你想做的事。如果要为每个Reducer输出一个键值对,并将值作为迭代的串联,则只需将迭代内容存储在内存中(例如,在StringBuffer中),然后将该串联输出为值。如果您希望地图输出直接进入程序输出,而不经过缩小阶段,那么请在驱动程序类中将减少任务的数量设置为零,如下所示:

  job.setNumReduceTasks(0); 

虽然这不会让您的输出排序。它将跳过映射器的预排序过程并将输出直接写入HDFS。

I am a bit confused with the output I get from Mapper.

For example, when I run a simple wordcount program, with this input text:

hello world
Hadoop programming
mapreduce wordcount
lets see if this works
12345678
hello world
mapreduce wordcount

this is the output that I get:

12345678    1
Hadoop  1
hello   1
hello   1
if  1
lets    1
mapreduce   1
mapreduce   1
programming 1
see 1
this    1
wordcount   1
wordcount   1
works   1
world   1
world   1

As you can see, the output from mapper is already sorted. I did not run Reducer at all. But I find in a different project that the output from mapper is not sorted. So I am totally clear about this..

My questions are:

  1. Is the mapper's output always sorted?
  2. Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?
  3. Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer? A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?

解决方案

Is the mapper's output always sorted?

No. It is not sorted if you use no reducer. If you use a reducer, there is a pre-sorting process before the mapper's output is written to disk. Data gets sorted in the Reduce phase. What is happening here (just a guess) is that you are not specifying a Reducer class, which, in the new API, is translated into using the Identity Reducer (see this answer and comment). The Identity Reducer just outputs its input. To verify that, see the default Reducer counters (there should be some reduce tasks, reduce input records & groups, reduce output records...)

Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?

As I explained in the previous question, if you use no reducers, mapper does not sort the data. If you do use reducers, the data start getting sorted from the map phase and then get merge-sorted in the reduce phase.

Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer. A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?

Again, shuffling and sorting are parts of the Reduce phase. An Identity Reducer will do what you want. If you want to output one key-value pair per reducer, with the values being a concatenation of the iterables, just store the iterables in memory (e.g. in a StringBuffer) and then output this concatenation as a value. If you want the map output to go straight to the program's output, without going through a reduce phase, then set in the driver class the number of reduce tasks to zero, like that:

job.setNumReduceTasks(0);

This will not get your output sorted, though. It will skip the pre-sorting process of the mapper and write the output directly to HDFS.

这篇关于mapreduce作业的map阶段的输出是否始终排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆