关于字数示例,map如何减少并行处理在hadoop中的实际工作? [英] How does map reduce parallel processing really work in hadoop with respect to the word count example?

查看:110
本文介绍了关于字数示例,map如何减少并行处理在hadoop中的实际工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用单词计数示例来学习hadoop映射减少,请参见所附图表:

I am learning hadoop map reduce using word count example , please see the diagram attached :

我的问题是关于并行处理实际上是如何发生的,以下是我的理解/问题,如果我做错了,请纠正我:

My questions are regarding how the parallel processing actually happens , my understanding/questions below , please correct me if i am wrong :

  1. 拆分步骤::这将分配映射器的数量,这两个数据集将分配给两个不同的处理器 [p1,p2],因此,两个映射器?该拆分是由第一处理器P完成的.
  2. 映射步骤:现在,这些处理器[p1,p2]中的每一个都通过在键上应用所需的函数f()来将数据划分为键值对,键产生值v给出[k1,v1], [k2,v2].
  3. 合并步骤1 :在每个处理器中,值通过键[[k1,[v1,v2,v3]]]分组.
  4. 合并步骤2 :现在p1,p2将输出返回给P,P将两个结果键值对合并.这发生在P.
  5. 排序步骤:现在在这里P,将对所有结果进行排序.
  6. 减少步骤:在这里P将对每个单独的键[k1,[v1,v2,v3]]应用f()以给出[k1,V]
  1. Split step : This assigns number of mappers , here the two data sets go to two different processor [p1,p2] , so two mappers ? This splitting is done by first processor P.
  2. Mapping Step : Each of these processor [p1,p2] now divides the data into key value pairs by applying required function f() on keys which produces value v giving [k1,v1],[k2,v2].
  3. Merge Step 1 : Within each processor , values are grouped by key giving [k1,[v1,v2,v3]].
  4. Merge Step 2 : Now p1,p2 returns output to P which merges both the resultant key value pairs. This happens in P.
  5. Sorting Step : Now here P , will sort all the results.
  6. Reduce Step : Here P will apply f() on each individual keys [k1,[v1,v2,v3]] to give [k1,V]

让我知道这种理解是否正确,我有种种感觉,我在很多方面都完全不满意?

Let me know is this understanding right, i have a feeling i am completely off in many respects?

推荐答案

让我详细解释每个步骤,以便您更加清楚,我尝试使它们尽可能简短,但我建议您需要查阅官方文档( https://hadoop.apache.org/docs /r1.2.1/mapred_tutorial.html )以使您对整个过程有个很好的感觉

Let me explain each step in little bit detail so that it will be more clear to you and I have tried to keep them as brief as possible but I would recommend you to go through offical docs (https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html) to get a good feel about this whole process

  1. 拆分步骤:如果您现在已经编写了一些程序,则必须已观察到有时我们会设置一些化简器,但由于映射器数取决于映射数的原因,我们从未设置过映射器数.输入拆分.简而言之,任何作业中的映射器都不与输入分割数成正比.因此,现在出现了一个问题,即分裂是如何发生的.这实际上取决于诸如mapred.max.split.size之类的许多因素,这些因素设置了输入拆分的大小,还有许多其他方法,但事实上我们可以控制输入拆分的大小.

  1. Split Step: if you have made some program by now you must have observed that we sometimes set a number of reducers but we never set a number of mapper because of the reason that number of mapper depend on the number of input splits. In simple words, no of mapper in any job is proportional to a number of input split. So now the question arises how do splittings take place. That actually depends on a number of factors like mapred.max.split.size which set the size of input split and there are many other ways but the fact we can control the size of input split.

映射步骤:如果按2个处理器表示2个JVM(2个容器)或2个不同的节点或2个映射器,则您的直觉是错误的容器,或者说节点与拆分任何输入文件无关,这是工作hdfs的组成部分,这些文件将文件分割并分发到不同的节点上,然后资源管理器有责任在具有输入拆分的同一节点上启动映射器任务,如果可能的话,一旦启动映射任务,您就可以创建键和值对根据您在mapper中的逻辑.这里要记住的一件事是,一个映射器只能在一个输入拆分上工作.

Mapping Step: if by 2 processors you mean 2 JVM(2 containers) or 2 different node or 2 mapper then your intuition is wrong container or for say node have nothing to do with splitting any input file it is job of hdfs which divide and distribute the files on different nodes and then it is the responsibility of resource manager to initiate the mapper task on the same node which has the input splits if possible and once map task is initiated you can create pair of key and value according to your logic in mapper. Here one thing to remember is that one mapper can only work on one input split.

您在第3步,第4步和第5步中有些混淆.我试图通过参考处理这些步骤的实际类来说明这些步骤.

you have mixed up a little bit on step 3, step 4 and step 5. I have tried to explain those step by describing in reference with the actual classes which handles these steps.

  1. Partitioner类:此类根据减速器的数量将mapper任务的输出进行划分.如果您的减速器数量超过1个,则该类很有用,否则不会影响您的输出.此类包含一个名为getPartition的方法,该方法决定您的映射器输出将转到哪个缩减器(如果您有多个reducer),该方法针对映射器输出中存在的每个键被调用.您可以重写该类,然后再重写此方法以根据您的要求对其进行自定义.因此,在您的示例中,由于存在一个化简器,因此它将两个映射器的输出合并到一个文件中.如果会有更多的reducer并会创建相同数量的中间文件.

  1. Partitioner class: This class divide the output from mapper task according to the number of reducers. This class is useful if you have more then 1 reducer otherwise it do not effect your output. This class contain a method called getPartition which decide to which reducer your mapper output will go(if you have more than one reducer) this method is called for each key which is present in mapper output. You can override this class and subsiquently this method to customize it according to your requirment. So in case of your example since there is one reducer so it will merge output from both the mapper in a single file. If there would have been more reducer and the same number of intermediate files would have been created.

WritableComparator类:此类的地图输出排序是根据键进行的.像分区器类一样,您可以覆盖它.在您的示例中,如果键是颜色名称,则它将像这样对它们进行排序(在这里,我们正在考虑,如果您不覆盖此类,那么它将使用默认的方法按字母顺序对Text进行排序):

WritableComparator class : Sorting of your map output is done by this class This sorting is done on the basis of key. Like partitioner class you can override this. In your example if the key is colour name then it will sort them like this (here we are considering if you do not overide this class then it will use the default method for sorting of Text which is alphabatical order):


    Black,1
    Black,1
    Black,1
    Blue,1
    Blue,1
    .
    .
    and so on 

现在,该类也用于根据键对值进行分组,以便在化简器中使用Ex->

Now this same class is also used for grouping your values according your key so that in reducer you can use iterable over them in case of your Ex ->

Black -> {1,1,1}   
Blue -> {1,1,1,1,1,1}
Green -> {1,1,1,1,1}
.
.
and so on

  1. Reducer->此步骤将简化您的映射,使其符合reducer类中定义的逻辑.您的授课方式适合该课程.

现在还存在其他一些影响,这些影响会影响mapper和reducer之间以及在mapper之前的中间步骤,但是这些与您想知道的内容并没有太大关系.

Now there are some other implications also which effect the intermediae step between mapper and reducer and before mapper also but those are not that much relevent to what you want to know.

我希望这可以解决您的查询.

I Hope this solve your query.

这篇关于关于字数示例,map如何减少并行处理在hadoop中的实际工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆