使用Map Reduce实施储层采样 [英] Implementing Reservoir Sampling using Map Reduce

查看:63
本文介绍了使用Map Reduce实施储层采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此链接" http://had00b.blogspot.com /2013/07/random-subset-in-mapreduce.html "讨论了如何使用地图缩减框架实施储层采样.我觉得他们的解决方案很复杂,下面的更简单的方法也行得通.

问题: 在给定大量样本的情况下,生成大小为k的集合,以使每个样本在集合中存在的可能性相同.

建议的解决方案:

  1. 映射操作:对于每个输入数字n,输出(i,n),其中i在0到k-1的范围内随机选择.
  2. 减少操作:在所有具有相同键的数字中,随机选择一个.

声明: 在k个大小的集合中,任何数字的概率为k/n(其中n是样本总数)

证明直觉:

< p>由于映射操作将每个输入样本随机分配给存储桶编号i(0< = i< = k-1),因此每个存储桶的大小将为n/k. 现在,每个数字仅存在于一个存储桶中,假设存储桶i.在铲斗i的归约运算中被拾取的概率为1/(n/k)= k/n

无论解决方案看起来是否正确,我都会对我的解决方案有任何其他想法.

解决方案

您的论点有一个小缺陷.您的算法可能不会返回大小为k的样本,因为可能不会发生任何元素映射到特定键的情况.在极端情况下(即使机会很小),可能会发生所有输入元素仅映射到一个键的情况,在这种情况下,您的算法仅返回一个元素.

缺少"特定键的事件的概率为((k-1)/k)^ n =(1-1/k)^ n,其概率近似为(使用泰勒近似法)e ^ {-n/k }.如果k远小于n,则可以忽略不计,但是如果k与n成正比,例如k = n/2,则此不良事件实际上以恒定的概率发生.

This link "http://had00b.blogspot.com/2013/07/random-subset-in-mapreduce.html" talks about how one can implement reservoir sampling using map reduce framework. I feel their solution is complicated and the following simpler approach would work.

Problem: Given very large number of samples, generate a set of size k such that each sample has equal probability of being present in the set.

Proposed solution:

  1. Map operation: For each input number n, output (i, n) where i is randomly chosen in range 0 to k-1.
  2. Reduce operation: Among all numbers with same key, pick one randomly.

Claim: Probability of any number being in set of k size is k/n (where n is total number of samples)

Proof intuition:

Since map operation randomly assigned each input sample to bucket number i (0 <= i <= k-1), size of each bucket would be n/k. Now each number is present only in one bucket, suppose bucket i. The probability that it gets picked in reduce operation for bucket i is 1/(n/k) = k/n

I would appreciate any second thoughts on my solution, whether it seems correct or not.

解决方案

There's a small flaw in your argument. Your algorithm may not return a sample of size k, as it may happen that none of the element gets mapped to a specific key. In the extreme case (even if it has small chance), it may happen that all the input elements get mapped to only one key, in which case your algorithm returns only one element.

The event of "missing" a specific key has probability ((k-1)/k)^n = (1-1/k)^n which is approximately (using Taylor approximation) e^{-n/k}. This is negligible if k is much smaller than n, but if k is proportional to n, say k=n/2, then this bad event actually happens with constant probability.

这篇关于使用Map Reduce实施储层采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆