如何在Spark RDD中获取具有准确样本大小的样本? [英] How to get a sample with an exact sample size in Spark RDD?

查看:197
本文介绍了如何在Spark RDD中获取具有准确样本大小的样本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

即使分数参数相同,为什么Spark RDD上的rdd.sample()函数也会返回不同数量的元素?例如,如果我的代码如下:

Why does the rdd.sample() function on Spark RDD return a different number of elements even though the fraction parameter is the same? For example, if my code is like below:

val a = sc.parallelize(1 to 10000, 3)
a.sample(false, 0.1).count

每次运行代码的第二行时,它都会返回不等于1000的不同数字.实际上我希望每次都能看到1000,尽管1000个元素可能有所不同.谁能告诉我如何获得样本大小恰好等于1000的样本?非常感谢.

Every time I run the second line of the code it returns a different number not equal to 1000. Actually I expect to see 1000 every time although the 1000 elements might be different. Can anyone tell me how I can get a sample with the sample size exactly equal to 1000? Thank you very much.

推荐答案

如果您想要一个准确的样本,请尝试

If you want an exact sample, try doing

a.takeSample(false, 1000)

但是请注意,这将返回一个数组,而不是一个RDD.

But note that this returns an Array and not an RDD.

关于a.sample(false, 0.1)为什么不返回相同样本大小的原因:这是因为spark内部使用了称为 Bernoulli采样进行采样. fraction参数不代表RDD实际大小的一部分.它代表了样本中每个元素被选中的概率,正如维基百科所说:

As for why the a.sample(false, 0.1) doesn't return the same sample size: that's because spark internally uses something called Bernoulli sampling for taking the sample. The fraction argument doesn't represent the fraction of the actual size of the RDD. It represent the probability of each element in the population getting selected for the sample, and as wikipedia says:

因为样本中人口的每个元素都是分开考虑的,所以样本大小不是固定的,而是遵循二项式分布的.

Because each element of the population is considered separately for the sample, the sample size is not fixed but rather follows a binomial distribution.

从本质上讲,这个数字不会保持固定.

And that essentially means that the number doesn't remain fixed.

如果将第一个参数设置为true,则它将使用泊松采样,这也会导致不确定的结果样本大小.

If you set the first argument to true, then it will use something called Poisson sampling, which also results in a non-deterministic resultant sample size.

更新

如果要坚持使用sample方法,则可以为fraction参数指定较大的概率,然后按如下所示调用take:

If you want stick with the sample method, you can probably specify a larger probability for the fraction param and then call take as in:

a.sample(false, 0.2).take(1000)

在大多数情况下,但不一定总是如此,这将导致样本数量为1000.如果您的人口足够大,则可以使用此方法.

This should, most of the time, but not necessarily always, result in the sample size of 1000. This could work if you have a large enough population.

这篇关于如何在Spark RDD中获取具有准确样本大小的样本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆