简单随机采样和数据帧 SAMPLE 函数如何在 Apache Spark (Scala) 中工作? [英] How do simple random sampling and dataframe SAMPLE function work in Apache Spark (Scala)?

查看:35
本文介绍了简单随机采样和数据帧 SAMPLE 函数如何在 Apache Spark (Scala) 中工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

第一季度.我正在尝试使用带有参数 withReplacement: false,fraction: 0.6 的示例函数从 Spark 数据帧(13 行)中获取一个简单的随机样本,但是每次运行它时它都会给我不同大小的样本,尽管它在以下情况下工作正常我设置了第三个参数(种子).为什么会这样?

Q1. I am trying to get a simple random sample out of a Spark dataframe (13 rows) using the sample function with parameters withReplacement: false, fraction: 0.6 but it gives me samples of different sizes every time I run it, though it work fine when I set the third parameter (seed). Why so?

第 2 季度.随机数生成后的样本是如何得到的?

Q2. How is the sample obtained after random number generation?

提前致谢

推荐答案

随机数生成后的样本是如何得到的?

How is the sample obtained after random number generation?

根据您要采样的分数,有两种不同的算法.您可以查看 Justin's PihonySPARK Dataframes 上的采样方法是否均匀采样?

Depending on a fraction you want to sample there are two different algorithms. You can check Justin's Pihony answer to SPARK Is sample method on Dataframes uniform sampling?

每次运行时它都会给我不同大小的样本,尽管在我设置第三个参数(种子)时它工作正常.为什么会这样?

it gives me samples of different sizes every time I run it, though it work fine when I set the third parameter (seed). Why so?

如果分数高于 RandomSampler.defaultMaxGapSamplingFraction 采样由 一个简单的过滤器:

If fraction is above RandomSampler.defaultMaxGapSamplingFraction sampling is done by a simple filter:

items.filter { _ => rng.nextDouble() <= fraction }

否则,稍微简化一下,它使用随机整数重复调用 drop 方法并获取下一项.

otherwise, simplifying things a little bit, it is repeatedly calling drop method using random integers and takes next item.

记住这一点应该很明显,返回的元素数量将是随机的,假设 GapSamplingIterator 没有问题,等于分数 * rdd.count.如果您设置种子,您将获得相同的随机数序列,因此样本中包含相同的元素.

Keeping that in mind it should be obvious that a number of returned elements will be random with mean, assuming there is nothing wrong with GapSamplingIterator, equal to fraction * rdd.count. If you set seed you get the same sequence of random numbers and as a consequence the same elements are included in the sample.

这篇关于简单随机采样和数据帧 SAMPLE 函数如何在 Apache Spark (Scala) 中工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆