SPARK Dataframes 上的采样方法是统一采样吗? [英] SPARK Is sample method on Dataframes uniform sampling?

查看:40
本文介绍了SPARK Dataframes 上的采样方法是统一采样吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从数据框中随机选择一定数量的行,我知道样本方法可以做到这一点,但我担心我的随机性应该是均匀采样?所以,我想知道Spark on Dataframes的示例方法是否统一?

I want to choose randomly a select number of rows from a dataframe and I know sample method does this, but I am concerned that my randomness should be uniform sampling? So, I was wondering if the sample method of Spark on Dataframes is uniform or not?

谢谢

推荐答案

这里有几个代码路径:

  • 如果 withReplacement = false &&分数>.4 然后它使用增强的随机数生成器 (rng.nextDouble() <=fraction) 并让它完成工作.这看起来很统一.
  • 如果 withReplacement = false &&分数 <= .4 然后它使用更复杂的算法(GapSamplingIterator) 看起来也很统一.乍一看,好像也应该是统一的
  • 如果 withReplacement = true 它确实接近相同的事情,除了它的外观可以复制所以这在我看来不会像第一个一样统一两个
  • If withReplacement = false && fraction > .4 then it uses a souped up random number generator (rng.nextDouble() <= fraction) and lets that do the work. This seems like it would be pretty uniform.
  • If withReplacement = false && fraction <= .4 then it uses a more complex algorithm (GapSamplingIterator) that also seems pretty uniform. At a glance, it looks like it should be uniform also
  • If withReplacement = true it does close to the same thing, except it can duplicate by the looks of it, so this looks to me like it would not be as uniform as the first two

这篇关于SPARK Dataframes 上的采样方法是统一采样吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆