Spark-随机数生成 [英] Spark - Random Number Generation
问题描述
我编写了一种方法,该方法必须考虑随机数以模拟伯努利分布.我正在使用random.nextDouble
生成一个介于0和1之间的数字,然后根据给定我的概率参数的值做出决定.
I have written a method that must consider a random number to simulate a Bernoulli distribution. I am using random.nextDouble
to generate a number between 0 and 1 then making my decision based on that value given my probability parameter.
我的问题是Spark在for循环映射函数的每次迭代中都生成相同的随机数.我正在使用DataFrame
API.我的代码遵循以下格式:
My problem is that Spark is generating the same random numbers within each iteration of my for loop mapping function. I am using the DataFrame
API. My code follows this format:
val myClass = new MyClass()
val M = 3
val myAppSeed = 91234
val rand = new scala.util.Random(myAppSeed)
for (m <- 1 to M) {
val newDF = sqlContext.createDataFrame(myDF
.map{row => RowFactory
.create(row.getString(0),
myClass.myMethod(row.getString(2), rand.nextDouble())
}, myDF.schema)
}
这是课程:
class myClass extends Serializable {
val q = qProb
def myMethod(s: String, rand: Double) = {
if (rand <= q) // do something
else // do something else
}
}
每次调用myMethod
时,我都需要一个新的随机数.我还尝试使用java.util.Random
(scala.util.Random
v10不会扩展Serializable
)在方法内部生成数字,如下所示,但是我仍然在每个for循环中获得相同的数字
I need a new random number every time myMethod
is called. I also tried generating the number inside my method with java.util.Random
(scala.util.Random
v10 does not extend Serializable
) like below, but I'm still getting the same numbers within each for loop
val r = new java.util.Random(s.hashCode.toLong)
val rand = r.nextDouble()
我已经做过一些研究,似乎与Sparks确定性有关.
I've done some research, and it seems this has do to with Sparks deterministic nature.
推荐答案
重复相同序列的原因是,在对数据进行分区之前,将创建随机生成器并使用种子对其进行初始化.然后,每个分区都从相同的随机种子开始.也许这不是最有效的方法,但是以下方法应该起作用:
The reason why the same sequence is repeated is that the random generator is created and initialized with a seed before the data is partitioned. Each partition then starts from the same random seed. Maybe not the most efficient way to do it, but the following should work:
val myClass = new MyClass()
val M = 3
for (m <- 1 to M) {
val newDF = sqlContext.createDataFrame(myDF
.map{
val rand = scala.util.Random
row => RowFactory
.create(row.getString(0),
myClass.myMethod(row.getString(2), rand.nextDouble())
}, myDF.schema)
}
这篇关于Spark-随机数生成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!