Spark RDD的示例方法无法按预期工作 [英] the sample method of Spark RDD does not work as expected

查看:87
本文介绍了Spark RDD的示例方法无法按预期工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在Spark 1.6.1上使用RDD的示例"方法

I am trying with the "sample" method of RDD on Spark 1.6.1

scala>val nu = sc.parallelize(1 to 10)
scala>val sp =  nu.sample(true,0.2)
scala>sp.collect.foreach(println(_))

38

scala>val sp2 = nu.sample(true, 0.2)
scala>sp2.collect.foreach(println(_))

247810

我不明白为什么sp2包含2,4,7,8,10.我认为应该只印两个数字.有什么问题吗?

I cannot understand why sp2 contains 2,4,7,8,10. I think there should be only two numbers printed. Is there anything wrong?

推荐答案

详细说明上一个答案:在

Elaborating on the previous answer: in the documentation (scroll down to sample) it is mentioned (emphasis mine):

分数:预期的样本大小,占该RDD大小的一部分,无需替换:选择每个元素的概率;分数必须为[0,1],并要替换:期望:选择每个元素的次数;分数必须> = 0

fraction: expected size of the sample as a fraction of this RDD's size without replacement: probability that each element is chosen; fraction must be [0, 1] with replacement: expected number of times each element is chosen; fraction must be >= 0

期望的"视上下文而定,可能有多种含义,但其中一个肯定没有的含义是精确",因此,样本大小的确切数量也有所不同.

'Expected' can have several meanings depending on the context, but one meaning it certainly does not have is 'exact', hence the varying exact number of the sample size.

如果您想要绝对固定的样本大小,则可以使用 takeSample 方法,缺点是它返回一个数组(即非RDD),该数组必须适合您的主内存:

If you want absolutely fixed sample sizes, you may use the takeSample method, the downside being that it returns an array (i.e. not an RDD), which must fit in your main memory:

val nu = sc.parallelize(1 to 10)
/** set seed for reproducibility */
val sp1 = nu.takeSample(true, 2, 182453) 
sp1: Array[Int] = Array(7, 2)

val sp2 = nu.takeSample(true, 2)
sp2: Array[Int] = Array(2, 10)

val sp3 = nu.takeSample(true, 2)
sp2: Array[Int] = Array(4, 6)

这篇关于Spark RDD的示例方法无法按预期工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆