Apache Spark(Scala)中的简单随机抽样和数据帧SAMPLE函数如何工作? [英] How do simple random sampling and dataframe SAMPLE function work in Apache Spark (Scala)?

查看:4838
本文介绍了Apache Spark(Scala)中的简单随机抽样和数据帧SAMPLE函数如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Q1。我试图从Spark数据框(13行)中获取一个简单的随机样本,使用带有参数的示例函数,其中包括:Replacement:false,fraction:0.6,但是它每次运行时都会提供不同大小的样本,尽管它可以正常工作我设置了第三个参数(seed)。为什么这样?



Q2。随机数生成后的样本如何获取?



提前感谢

解决方案


随机数生成后的样本如何获取?


根据您要样本有两种不同的算法。您可以查看 Justin的Pihony 回复 SPARK是数据帧统一抽样的样本方法?


它给我每次运行不同大小的样本,但是当我设置第三个参数(种子)时它工作正常。为什么这样?


如果分数高于 RandomSampler.defaultMaxGapSamplingFraction 一个简单的过滤器

  items.filter {_ => rng.nextDouble()< = fraction} 

否则,简单的一点点,它是反复使用随机整数调用 drop 方法,然后使用下一项。



记住这一点,返回的元素将是随机的,假设没有错误的 GapSamplingIterator ,等于fraction * rdd.count。如果您设置种子,您将获得相同的随机数序列,因此样本中包含相同的元素。


Q1. I am trying to get a simple random sample out of a Spark dataframe (13 rows) using the sample function with parameters withReplacement: false, fraction: 0.6 but it gives me samples of different sizes every time I run it, though it work fine when I set the third parameter (seed). Why so?

Q2. How is the sample obtained after random number generation?

Thanks in advance

解决方案

How is the sample obtained after random number generation?

Depending on a fraction you want to sample there are two different algorithms. You can check Justin's Pihony answer to SPARK Is sample method on Dataframes uniform sampling?

it gives me samples of different sizes every time I run it, though it work fine when I set the third parameter (seed). Why so?

If fraction is above RandomSampler.defaultMaxGapSamplingFraction sampling is done by a simple filter:

items.filter { _ => rng.nextDouble() <= fraction }

otherwise, simplifying things a little bit, it is repeatedly calling drop method using random integers and takes next item.

Keeping that in mind it should be obvious that a number of returned elements will be random with mean, assuming there is nothing wrong with GapSamplingIterator, equal to fraction * rdd.count. If you set seed you get the same sequence of random numbers and as a consequence the same elements are included in the sample.

这篇关于Apache Spark(Scala)中的简单随机抽样和数据帧SAMPLE函数如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆