Spark - 随机数生成 [英] Spark - Random Number Generation

查看:135
本文介绍了Spark - 随机数生成的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个方法,它必须考虑一个随机数来模拟伯努利分布.我正在使用 random.nextDouble 生成一个介于 0 和 1 之间的数字,然后根据给定概率参数的该值做出决定.

I have written a method that must consider a random number to simulate a Bernoulli distribution. I am using random.nextDouble to generate a number between 0 and 1 then making my decision based on that value given my probability parameter.

我的问题是 Spark 在我的 for 循环映射函数的每次迭代中生成相同的随机数.我正在使用 DataFrame API.我的代码遵循这种格式:

My problem is that Spark is generating the same random numbers within each iteration of my for loop mapping function. I am using the DataFrame API. My code follows this format:

val myClass = new MyClass()
val M = 3
val myAppSeed = 91234
val rand = new scala.util.Random(myAppSeed)

for (m <- 1 to M) {
  val newDF = sqlContext.createDataFrame(myDF
    .map{row => RowFactory
      .create(row.getString(0),
        myClass.myMethod(row.getString(2), rand.nextDouble())
    }, myDF.schema)
}

这是课程:

class myClass extends Serializable {
  val q = qProb

  def myMethod(s: String, rand: Double) = {
    if (rand <= q) // do something
    else // do something else
  }
}

每次调用 myMethod 时我都需要一个新的随机数.我还尝试使用 java.util.Random (scala.util.Random v10 不扩展 Serializable) 在我的方法中生成数字,如下所示,但我仍然在每个 for 循环中得到相同的数字

I need a new random number every time myMethod is called. I also tried generating the number inside my method with java.util.Random (scala.util.Random v10 does not extend Serializable) like below, but I'm still getting the same numbers within each for loop

val r = new java.util.Random(s.hashCode.toLong)
val rand = r.nextDouble()

我做了一些研究,似乎这与 Sparks 的确定性性质有关.

I've done some research, and it seems this has do to with Sparks deterministic nature.

推荐答案

重复相同序列的原因是随机生成器是在数据分区之前创建并用种子初始化的.然后每个分区从相同的随机种子开始.也许不是最有效的方法,但以下应该有效:

The reason why the same sequence is repeated is that the random generator is created and initialized with a seed before the data is partitioned. Each partition then starts from the same random seed. Maybe not the most efficient way to do it, but the following should work:

val myClass = new MyClass()
val M = 3

for (m <- 1 to M) {
  val newDF = sqlContext.createDataFrame(myDF
    .map{ 
       val rand = scala.util.Random
       row => RowFactory
      .create(row.getString(0),
        myClass.myMethod(row.getString(2), rand.nextDouble())
    }, myDF.schema)
}

这篇关于Spark - 随机数生成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆