MapReduce中的分区如何工作? [英] How does partitioning in MapReduce exactly work?

查看:116
本文介绍了MapReduce中的分区如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我认为我对MapReduce编程模型有一个普遍的理解,但即使在阅读了原始论文和其他一些资料后,我仍然不清楚许多细节,特别是关于中间结果的划分。

b
$ b

到目前为止,我将快速总结我对MapReduce的理解:我们有一个潜在的非常大的输入数据集,它被MR-Framework自动分成M个不同的部分。对于每一块,框架都会调度一个由我的集群中可用处理器/机器之一执行的映射任务。 M个地图任务中的每一个输出一组键值对,其本地存储在执行该地图任务的同一台机器上。每台机器将其磁盘划分为R个分区,并根据分区间的中间密钥分配计算出的中间密钥值对。然后,框架开始为每个不同的中间键一个减少任务,再由任何可用的机器执行。



现在我的问题是:


  1. 在某些教程中,似乎可能会有并行执行的map任务和reduce任务。这是正确的吗?假设对于每个不同的中间密钥,只有一个减少任务开始,那怎么可能呢?在我们开始第一个reduce任务之前,我们不必等到最后的map任务完成了吗?
  2. 因为每个不同的中间密钥都有一个reduce任务,所以每个reduce任务都是正确的任务要求执行机器从其他机器加载相应的分区?潜在地,每台机器都可以有一个具有所需中间密钥的密钥值对,因此对于每个减少任务,我们可能需要查询所有其他机器。这真的很高效吗?

  3. 原始文件说分区数(R)是由用户指定的。但是,减少任务的输入不是分区吗?或者更确切地说:不是所有机器中具有相同编号的所有分区的联合是一个reduce任务的输入?这意味着,R取决于用户通常不知道的不同中间键的数量。


    从概念上讲,它很清楚地图和减少功能/任务的输入和输出是什么。但是我认为我还没有从技术层面理解MapReduce。有人可以帮我理解吗? 您可以在地图任务完成时启动reducer任务仍然在运行(使用一种称为slowstart的功能),但是减速器只能运行复制阶段(从已完成的地图任务中获取已完成的结果,需要等待所有映射器完成才能实际执行最终排序减少任务实际上处理零个或一个或多个键(而不是每个键的离散任务)每个缩减器都需要从每个地图任务获取地图输出,涉及它的分区,然后对这些中间输出进行排序,然后一次减少一个键集。

  4. 返回2中的注释 - reducer任务(每个分区一个)运行在零,一个或多个密钥,而不是每个离散密钥的单个任务。

了解中介的传播和变化也很重要吃掉密钥,因为它被哈希和模(如果使用默认的HashPartitioner)来确定哪个reduce分区应该处理该密钥。假设你有偶数个reducer任务(10),并且输出密钥总是散列为偶数 - 那么在这种情况下,这些散列数和10的模数将始终为偶数,这意味着奇数减数将会永远不会处理任何数据。


I think I have a fair understanding of the MapReduce programming model in general, but even after reading the original paper and some other sources many details are unclear to me, especially regarding the partitioning of the intermediate results.

I will quickly summarize my understanding of MapReduce so far: We have a potentially very large input data set, which is automatically split up into M different pieces by the MR-Framework. For each piece, the framework schedules one map task which is executed by one of the available processors/machines in my cluster. Each of the M map tasks outputs a set of Key-Value-Pairs, which is stored locally on the same machine that executed this map task. Each machine divides its disk into R partitions and distributes its computed intermediate key value pairs based on the intermediate keys among the partitions. Then, the framework starts for each distinct intermediate key one reduce task which is again executed by any of the available machines.

Now my questions are:

  1. In some tutorials it sounds like there could be map and reduce tasks executed in parallel. Is this right? How could that be, assuming that for each distinct intermediate key only one reduce task is started? Do we not have to wait until the last map task is finished before we can start the first reduce task?
  2. As we have one reduce task per distinct intermediate key, is it right that each reduce task requires the executing machine to load the corresponding partition from every other machine? Potentially, every machine can have a key-value-pair with the desired intermediate key, so for each reduce task we potentially have to query all other machines. Is that really efficient?
  3. The original paper says that the number of partitions (R) is specified by the user. But isn’t a partition the input for a reduce task? Or more exactly: Isn’t the union of all partitions with the same number among all machines the input of one reduce task? That would mean, that R depends on the number of distinct intermediate keys which the user usually doesn’t know.

Conceptually it is clear what the input and outputs of the map and reduce functions/tasks are. But I think I haven’t yet understood MapReduce on the technical level. Could somebody please help me understanding?

解决方案

  1. You can start the reducer tasks while the map tasks are still running (using a feature known as slowstart), but the reducers can only run the copy phase (acquiring the completed results from the completed map tasks. It will need to wait for all the mappers to complete before it can actually perform the final sort and reduce.
  2. A reduce task actually processes zero, one or more keys (rather than a discrete tasks for each key). Each reducer will need to acquire the map output from each map task that relates to its partition before these intermediate outputs are sorted and then reduced one key set at a time.
  3. Back to the note in 2 - a reducer task (one for each partition) runs on zero, one or more keys rather than a single task for each discrete key.

It's also important to understand the spread and variation of your intermediate key as it is hashed and modulo'd (if using the default HashPartitioner) to determine which reduce partition should process that key. Say you had an even number of reducer tasks (10), and output keys that always hashed to an even number - then in this case the modulo of these hashs numbers and 10 will always be an even number, meaning that the odd numbered reducers would never process any data.

这篇关于MapReduce中的分区如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆