计算地图中的中值减少 [英] Computing median in map reduce

查看:217
本文介绍了计算地图中的中值减少的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以举例说明地图中中位数/分位数的计算是否减少?



我Datafu的中位数的理解是, 'N' 映射器排序
数据和发送数据为 1 减速器,其负责排序
所有来自n个mappers的数据并找到中位数(中间值)
我的理解是否正确?如果是的话,这种方法是否为
海量数据,因为我可以清楚地看到单个减速器
正在努力完成最终任务。
由于


解决方案

试图找到在一系列的中位数(中间号码)将需要一个1级减速器是通过数字来确定哪些是中间值的整个范围。



根据值在你的输入设置的范围和独特性,你可以介绍一个组合来输出每个值的频率 - 减少发送到单个减速器的地图输出的数量。您的缩减器可以使用排序值/频率对来确定中位数。



另一种方法可以缩放(如果知道值的范围和粗略分布)是使用通过范围桶(0-99转到减速器0,100-199到减速器2,等等)分配键的定制分区器。然而,这将需要一些次要的工作来检查减速器输出,并执行最终的中值计算(知道例如在每种减速器按键的数量,可以计算该减速机的输出将包含中值,并且在该偏移量)


Can someone example the computation of median/quantiles in map reduce?

My understanding of Datafu's median is that the 'n' mappers sort the data and send the data to "1" reducer which is responsible for sorting all the data from n mappers and finding the median(middle value) Is my understanding correct?,

if so, does this approach scale for massive amounts of data as i can clearly see the one single reducer struggling to do the final task. Thanks

解决方案

Trying to find the median (middle number) in a series is going to require that 1 reducer is passed the entire range of numbers to determine which is the 'middle' value.

Depending on the range and uniqueness of values in your input set, you could introduce a combiner to output the frequency of each value - reducing the number of map outputs sent to your single reducer. Your reducer can then consume the sort value / frequency pairs to identify the median.

Another way you could scale this (again if you know the range and rough distribution of values) is to use a custom partitioner that distributes the keys by range buckets (0-99 go to reducer 0, 100-199 to reducer 2, and so on). This will however require some secondary job to examine the reducer outputs and perform the final median calculation (knowing for example the number of keys in each reducer, you can calculate which reducer output will contain the median, and at which offset)

这篇关于计算地图中的中值减少的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆