MapReduce - 我如何计算相对值(平均值,最高值k等)? [英] MapReduce - how do I calculate relative values (average, top k and so)?

查看:365
本文介绍了MapReduce - 我如何计算相对值(平均值,最高值k等)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种方法来计算MapReduce过程中的全局或相对值 - 平均值,总和,最高等。假设我有一份工人清单,其ID与他们的工资相关联一堆其他的东西)。在加工的某个阶段,我想知道谁是获得前10%工资的工人。为此我需要一些值的全局视图,这是我无法弄清楚的。



如果我将所有值都发送到一个reducer中,全局的观点,但后来我放松了并发,看起来很尴尬。有没有更好的方法?



(我想用的框架是谷歌的,但我试图找出这种技术 - 请不要框架特定的技巧)

解决方案

我会这样做的


    映射器将使用UUID作为在映射器的setup()方法中创建的键的一部分。映射器发出密钥,UUID附加0或薪水。映射器累计计数和总数。在cleanup()方法中,映射器发出的UUID附加0作为键,计数和总和作为值。在map()方法中,映射器发出的UUID附加了薪金作为键和工资作为值。
  1. 由于键被排序,呼叫组合器将有数量和总数作为价值。组合器可以将它们存储为类成员。我们还可以找出总数的10%,并保存以及类成员(称之为最高)。我们初始化一个列表并将它保存为一个类成员。

  2. 后来对组合器的调用将包含作为值的工资,按排序顺序排列。我们将该值添加到列表中,同时增加一个计数器。当计数器达到最高值时,我们不会在我们的列表中存储更多值。我们忽略了其他组合器调用中的值。
  3. 在组合器cleanup()中,我们执行emit。组合器将只发出UUID作为密钥。该值将包含计数和总计,然后是前10%的值。因此,根据通过映射器的数据的子集,组合器的输出将具有部分结果。


  4. 缩减器将被多次调用作为这种情况下映射器的数量,因为每个映射器/组合器只发出一个键。


  5. 减速器将累计计数,总计和前10% reduce()方法中的值。在cleanup()方法中,平均值被计算。前10%也是在cleanup()方法中,从每次调用reducer的前10%的聚合中计算出来的。这基本上是一个合并排序。

  6. Reducer cleanup()方法可以执行多个发射,所以平均值在第一行,然后是前10

  7. 最后,为了确保最终的聚合统计数据是全局数据,您必须将减少数量设置为一个。


  8. 由于reducer中存在数据积累和排序,尽管在部分数据集上,可能存在内存问题。



I'm looking for a way to calculate "global" or "relative" values during a MapReduce process - an average, sum, top etc. Say I have a list of workers, with their IDs associated with their salaries (and a bunch of other stuff). At some stage of the processing, I'd like to know who are the workers who earn the top 10% of salaries. For that I need some "global" view of the values, which I can't figure out.

If I have all values sent into a single reducer, it has that global view, but then I loose concurrency, and it seems awkward. Is there a better way?

(The framework I'd like to use is Google's, but I'm trying to figure out the technique - no framework specific tricks please)

解决方案

I would do something like this

  1. The mapper will use an UUID as part of the key, created in the setup() method of the mapper. The mapper emits as key, the UUID appended with either 0 or the salary. The mapper accumulates the count and total.

  2. In the cleanup() method, the mapper emits UUID appended with 0 as the key and the count and total as the value. In the map() method, the mapper emits the UUID appended with salary as the key and salary as the value.

  3. Since the keys are sorted, the first call to combiner will have the count and total as the value. The combiner could store them as class members. We could also find out what 10% of total count is and save that as well as class member (call it top). We initialize a list and save it as a class member.

  4. Subsequent calls to combiner will contain the salary as the value, arriving in sorted order. We add the value to the list and at the same time increment a counter. When the counter reaches the value top, we don't store any more values in our list. We ignore values in rest of the combiner calls.

  5. In the combiner cleanup(), we do the emit. The combiner will emit only the UUID as the key. The value will contain count and total followed by the top 10% of the values. So the output of the combiner will have partial results, based on the subset of the data that passed through the mapper.

  6. The reducer will be called as many times as the number of mappers in this case, because each mapper/combiner emits only one key.

  7. The reducer will accumulate the counts, totals and the top 10% values in the reduce() method. In the cleanup() method, the average is calulated. The top 10% is also calculated in the cleanup() method from the aggregation of top 10% arriving in each call of the reducer. This is basically a merge sort.

  8. The reducer cleanup() method could do multiple emits, so that average is in the first row, followed by the top 10% of salaries in the subsequent rows.

  9. Finally, to ensure that final aggregate statistics are global, you have to set the number of reducers to one.

  10. Since there is data accumulation and sorting in the reducer, although on partial data set, there may be memory issues.

这篇关于MapReduce - 我如何计算相对值(平均值,最高值k等)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆