SPARK是Dataframes均匀采样样本的方法? [英] SPARK Is sample method on Dataframes uniform sampling?

查看:1450
本文介绍了SPARK是Dataframes均匀采样样本的方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从一个数据框选择行随机选择号码,我知道样品的方法做到这一点,但我担心我的随机性要均匀采样?所以,我在想,如果星火上Dataframes样品的方法是一致的或不?

I want to choose randomly a select number of rows from a dataframe and I know sample method does this, but I am concerned that my randomness should be uniform sampling? So, I was wondering if the sample method of Spark on Dataframes is uniform or not?

感谢

推荐答案

有这里有几个code路径:

There are a few code paths here:


  • 如果 withReplacement =假放;&安培;馏分GT; 0.4 然后它采用了改装成了随机数发生器( rng.nextDouble()< =分数),并让该做的工作。 这似乎是这将是pretty均匀。

  • 如果 withReplacement =假放;&安培;馏分LT; = 0.4 则采用了更复杂的算法(<一个href=\"https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala#L217\"相对=nofollow> GapSamplingIterator )也似乎pretty均匀。一目了然,它看起来应该是统一的也

  • 如果 withReplacement = TRUE 它接近同样的事情,<一个href=\"https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala#L199\"相对=nofollow>,除了它可以通过它看起来复制,所以这看起来对我来说,它不会像均匀,前两

  • If withReplacement = false && fraction > .4 then it uses a souped up random number generator (rng.nextDouble() <= fraction) and lets that do the work. This seems like it would be pretty uniform.
  • If withReplacement = false && fraction <= .4 then it uses a more complex algorithm (GapSamplingIterator) that also seems pretty uniform. At a glance, it looks like it should be uniform also
  • If withReplacement = true it does close to the same thing, except it can duplicate by the looks of it, so this looks to me like it would not be as uniform as the first two

这篇关于SPARK是Dataframes均匀采样样本的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆