在字数统计的例子中,map如何减少并行处理在hadop中真正起作用? [英] How does map reduce parallel processing really work in hadop with respect to the word count example?

查看:185
本文介绍了在字数统计的例子中,map如何减少并行处理在hadop中真正起作用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习使用字数统计的hadoop map reduce,请参阅附图:



我的问题是关于如何执行并行处理,我的理解/下面的问题,请纠正我,如果我错了:


  1. 拆分步骤: 这分配了多少个映射器,这里两个数据集是两个不同的处理器 [p1,p2],所以两个映射器?这种分割由第一个处理器P完成。

  2. 映射步骤:这些处理器[p1,p2]中的每一个现在都通过应用required函数f()生成值为v的键,给出[k1,v1],[k2,v2]。
  3. 合并步骤1 :在每个处理器中,
  4. 合并步骤2 :现在p1,p2将输出返回给P,它将两者合并结果键值对。这发生在P

  5. 排序步骤:现在在这里P将对所有结果进行排序

  6. 减少步骤:这里P将在每个单独的键[k1,[v1,v2,v3]]上应用f()以给出[k1,V]

让我知道这个理解是否正确,我有一种感觉,我在很多方面完全没有?

解决方案

让我详细解释每一步,以便更清楚地了解您和我试图保留它们尽可能简短,但我会建议你通过官方文档( https:/ /hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html )以获得对整个过程的良好感受


    <拆分步骤:如果您现在制作了一些程序,您必须观察到我们有时会设置一些减速器,但是我们从来没有设置一些映射器,因为映射器的数量取决于输入的数量分裂。简而言之,任何作业中的映射器都不会与多个输入拆分成比例。所以现在的问题是如何进行分裂。这实际上取决于很多因素,比如mapred.max.split.size,它设置了输入拆分的大小,还有很多其他方法,但是我们可以控制输入拆分的大小。

    $映射步骤:如果由2个处理器表示2个JVM(2个容器)或2个不同的节点或2个映射器,那么你的直觉就是错误的容器或节点与分裂任何输入文件无关它是在不同节点上分割和分配文件的hdfs的工作,然后资源管理器负责在同一个节点上启动映射器任务,如果可能的话将输入拆分,并且一旦启动映射任务,您可以创建一对关键和价值根据你的逻辑映射器。这里要记住的一件事是,一个映射器只能在一个输入分割中工作。

你混淆了一点在步骤3,步骤4和步骤5中,我试图解释这些步骤。通过参考处理这些步骤的实际类来描述这些步骤。


  1. 分区程序类:此类根据缩减程序的数量划分来自mapper任务的输出。如果你有更多的减速器,这个类很有用,否则它不会影响你的输出。这个类包含一个名为getPartition的方法,它决定了你的映射器输出将会传递给哪个reducer(如果你有多个reducer),这个方法被调用用于mapper输出中的每个键。您可以重写此类,并从本质上根据您的需求对此方法进行自定义。因此,在你的例子中,因为有一个reducer,所以它会将来自两个mapper的输出合并到一个文件中。如果将会有更多的reducer和相同数量的中间文件被创建。

  2. WritableComparator类:你的地图输出的排序是由这个类完成的这种排序是在关键的基础上完成的。像分区类一样,你可以覆盖这个。在你的例子中,如果键是颜色名称,那么它会像这样排序它们(在这里我们正在考虑如果你不覆盖这个类,那么它将使用默认方法排序文本,它是alphabatical顺序):




  
黑色,1
黑色,1
黑色,1
蓝色,1
蓝色,1


等等

现在这个类也用于根据您的键对您的值进行分组所以在reducer中,你可以在你的Ex中使用迭代 - >

  Black  - > {1,1,1} 
蓝色 - > {1,1,1,1,1,1}
绿色 - > {1,1,1,1,1}


等等




  1. Reducer - >这一步将简单地将您的地图缩减为Reducer类中的逻辑定义。你的initution适合这个类。

现在还有一些影响mapper和reducer之间以及映射器之间的intermediae步骤但是这些与你想知道的并不太相关。



我希望这能解决你的问题。


I am learning hadoop map reduce using word count example , please see the diagram attached :

My questions are regarding how the parallel processing actually happens , my understanding/questions below , please correct me if i am wrong :

  1. Split step : This assigns number of mappers , here the two data sets go to two different processor [p1,p2] , so two mappers ? This splitting is done by first processor P.
  2. Mapping Step : Each of these processor [p1,p2] now divides the data into key value pairs by applying required function f() on keys which produces value v giving [k1,v1],[k2,v2].
  3. Merge Step 1 : Within each processor , values are grouped by key giving [k1,[v1,v2,v3]].
  4. Merge Step 2 : Now p1,p2 returns output to P which merges both the resultant key value pairs. This happens in P.
  5. Sorting Step : Now here P , will sort all the results.
  6. Reduce Step : Here P will apply f() on each individual keys [k1,[v1,v2,v3]] to give [k1,V]

Let me know is this understanding right, i have a feeling i am completely off in many respects?

解决方案

Let me explain each step in little bit detail so that it will be more clear to you and I have tried to keep them as brief as possible but I would recommend you to go through offical docs (https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html) to get a good feel about this whole process

  1. Split Step: if you have made some program by now you must have observed that we sometimes set a number of reducers but we never set a number of mapper because of the reason that number of mapper depend on the number of input splits. In simple words, no of mapper in any job is proportional to a number of input split. So now the question arises how do splittings take place. That actually depends on a number of factors like mapred.max.split.size which set the size of input split and there are many other ways but the fact we can control the size of input split.

  2. Mapping Step: if by 2 processors you mean 2 JVM(2 containers) or 2 different node or 2 mapper then your intuition is wrong container or for say node have nothing to do with splitting any input file it is job of hdfs which divide and distribute the files on different nodes and then it is the responsibility of resource manager to initiate the mapper task on the same node which has the input splits if possible and once map task is initiated you can create pair of key and value according to your logic in mapper. Here one thing to remember is that one mapper can only work on one input split.

you have mixed up a little bit on step 3, step 4 and step 5. I have tried to explain those step by describing in reference with the actual classes which handles these steps.

  1. Partitioner class: This class divide the output from mapper task according to the number of reducers. This class is useful if you have more then 1 reducer otherwise it do not effect your output. This class contain a method called getPartition which decide to which reducer your mapper output will go(if you have more than one reducer) this method is called for each key which is present in mapper output. You can override this class and subsiquently this method to customize it according to your requirment. So in case of your example since there is one reducer so it will merge output from both the mapper in a single file. If there would have been more reducer and the same number of intermediate files would have been created.

  2. WritableComparator class : Sorting of your map output is done by this class This sorting is done on the basis of key. Like partitioner class you can override this. In your example if the key is colour name then it will sort them like this (here we are considering if you do not overide this class then it will use the default method for sorting of Text which is alphabatical order):


    Black,1
    Black,1
    Black,1
    Blue,1
    Blue,1
    .
    .
    and so on 

Now this same class is also used for grouping your values according your key so that in reducer you can use iterable over them in case of your Ex ->

Black -> {1,1,1}   
Blue -> {1,1,1,1,1,1}
Green -> {1,1,1,1,1}
.
.
and so on

  1. Reducer -> This step will simply reduce your map accroding to the logic define in your reducer class. you initution is appropriate for this class.

Now there are some other implications also which effect the intermediae step between mapper and reducer and before mapper also but those are not that much relevent to what you want to know.

I Hope this solve your query.

这篇关于在字数统计的例子中,map如何减少并行处理在hadop中真正起作用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆