Apache Spark(Scala)中的简单随机抽样和数据帧SAMPLE函数如何工作? [英] How do simple random sampling and dataframe SAMPLE function work in Apache Spark (Scala)?
问题描述
Q2。随机数生成后的样本如何获取?
提前感谢
随机数生成后的样本如何获取?
根据您要样本有两种不同的算法。您可以查看 Justin的Pihony 回复 SPARK是数据帧统一抽样的样本方法?
它给我每次运行不同大小的样本,但是当我设置第三个参数(种子)时它工作正常。为什么这样?
如果分数高于 RandomSampler.defaultMaxGapSamplingFraction
一个简单的过滤器:
items.filter {_ => rng.nextDouble()< = fraction}
否则,简单的一点点,它是反复使用随机整数调用 drop
方法,然后使用下一项。
记住这一点,返回的元素将是随机的,假设没有错误的 GapSamplingIterator
,等于fraction * rdd.count。如果您设置种子,您将获得相同的随机数序列,因此样本中包含相同的元素。
Q1. I am trying to get a simple random sample out of a Spark dataframe (13 rows) using the sample function with parameters withReplacement: false, fraction: 0.6 but it gives me samples of different sizes every time I run it, though it work fine when I set the third parameter (seed). Why so?
Q2. How is the sample obtained after random number generation?
Thanks in advance
How is the sample obtained after random number generation?
Depending on a fraction you want to sample there are two different algorithms. You can check Justin's Pihony answer to SPARK Is sample method on Dataframes uniform sampling?
it gives me samples of different sizes every time I run it, though it work fine when I set the third parameter (seed). Why so?
If fraction is above RandomSampler.defaultMaxGapSamplingFraction
sampling is done by a simple filter:
items.filter { _ => rng.nextDouble() <= fraction }
otherwise, simplifying things a little bit, it is repeatedly calling drop
method using random integers and takes next item.
Keeping that in mind it should be obvious that a number of returned elements will be random with mean, assuming there is nothing wrong with GapSamplingIterator
, equal to fraction * rdd.count. If you set seed you get the same sequence of random numbers and as a consequence the same elements are included in the sample.
这篇关于Apache Spark(Scala)中的简单随机抽样和数据帧SAMPLE函数如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!