一段R代码会影响foreach输出中的随机数吗? [英] Can piece of R code influence random numbers in foreach output?

查看:75
本文介绍了一段R代码会影响foreach输出中的随机数吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用foreachdoParallel进行了仿真,并使用随机数(在代码中命名为random)进行挣扎.

简而言之:我模拟了一个足球联赛,随机产生了所有比赛和相应结果的获胜者.在dt_base中没有比赛,在dt_ex1dt_ex2中已经知道4场比赛的结果.所有未知的结果都应进行模拟.

在这篇文章底部的《联盟模拟代码》中,我设置了1000个模拟,分为100个块(forloop用于将数据发送到PostgreSQL并减少我使用的完整代码中的RAM使用量). 我希望所有随机数都不同(甚至不要坚持要重现结果).

1.在按照给定的代码运行时,应该达到所有不同随机数的目标.

> # ====== Distinct Random Numbers ======
> length(unique(out$random))                              # expectation: 22000
[1] 22000
> length(unique(out$random[out$part == "base"]))          # expectation: 10000
[1] 10000
> length(unique(out$random[out$part == "dt_ex1"]))        # expectation: 6000
[1] 6000
> length(unique(out$random[out$part == "dt_ex2"]))        # expectation: 6000
[1] 6000

2.现在,请取消注释分配最终得分的代码段 *[tmp_sim] = 3(应该是60、61、67、68行上带有!!!的行),然后再次运行.

> # ====== Distinct Random Numbers ======
> length(unique(out$random))                              # expectation: 22000
[1] 10360
> length(unique(out$random[out$part == "base"]))          # expectation: 10000
[1] 10000
> length(unique(out$random[out$part == "dt_ex1"]))        # expectation: 6000
[1] 180
> length(unique(out$random[out$part == "dt_ex2"]))        # expectation: 6000
[1] 180

那是当它弄乱了,对我来说没有意义的时候.当在这些数据帧中添加几个数字时,iter内的random对于dt_ex1dt_ex2始终是相同的.

您是否遇到相同的效果?你知道发生了什么吗?

我尝试了R版本3.5.3和3.6.3.还尝试了doRNG软件包.总是一样的问题.

联盟模拟代码

# League Simulation
rm(list = ls())
set.seed(666)
cat("\014")
library(sqldf)
library(plyr)
library(dplyr)

# ====== User Functions ======
comb4 = function(x, ...) { #function for combining foreach output
  Map(rbind, x, ...)
}

# ====== Data Preparation ======
dt_base = data.frame(id = 1:10,
                  part = rep("base",10),
                  random = NA)

dt_ex1 = data.frame(id = 1:10,
                         part = rep("dt_ex1",10),
                         HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),  # Home Goals
                         AG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),  # Away Goals
                         random = NA)

dt_ex2 = data.frame(id = 1:10,
                            part = rep("dt_ex2",10),
                         HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),  # Home Goals
                         AG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),  # Away Goals
                         random = NA)

# ====== Set Parallel Computing ======
library(foreach)
library(doParallel)

cl = makeCluster(3, outfile = "")
registerDoParallel(cl)

# ====== SIMULATION ======
nsim = 1000                # number of simulations
iterChunk = 100            # split nsim into this many chunks
out = data.frame()    # prepare output DF
for(iter in 1:ceiling(nsim/iterChunk)){
  strt = Sys.time()
  
  out_iter = 
    foreach(i = 1:iterChunk, .combine = comb4, .multicombine = TRUE, .maxcombine = 100000, .inorder = FALSE, .verbose = FALSE,
            .packages = c("plyr", "dplyr", "sqldf")) %dopar% {
              
              ## PART 1
              # simulation number
              id_sim = iterChunk * (iter - 1) + i
              
              # First random numbers set
              dt_base[,"random"] = runif(nrow(dt_base))
              
              
              ## PART 2
              tmp_sim = is.na(dt_ex1$HG) # no results yet
              dt_ex1$random[tmp_sim] = runif(sum(tmp_sim))
              # dt_ex1$HG[tmp_sim] = 3   # !!!
              # dt_ex1$AG[tmp_sim] = 3   # !!!
              
              
              ## PART 3
              tmp_sim = is.na(dt_ex2$HG) # no results yet
              dt_ex2$random[tmp_sim] = runif(sum(tmp_sim))
              # dt_ex2$HG[tmp_sim] = 3   # !!!
              # dt_ex2$AG[tmp_sim] = 3   # !!!
              
              
              # ---- Save Results
              zapasy = rbind.data.frame(dt_base[,c("id","part","random")],
                                        dt_ex1[,c("id","part","random")]
                                        ,dt_ex2[,c("id","part","random")]
              )
              zapasy$id_sim = id_sim
              zapasy$iter = iter
              zapasy$i = i
              
              out_i = list(zapasy = zapasy)
              
              print(Sys.time())
              return(out_i)
            }#i;sim_forcycle
  
  out = rbind.data.frame(out,subset(out_iter$zapasy, !is.na(random)))
  
  fnsh = Sys.time()
  cat(" [",iter,"] ",fnsh - strt, sep = "")
  
}#iter


# ====== Distinct Random Numbers ======
length(unique(out$random))                              # expectation: 22000
length(unique(out$random[out$part == "base"]))          # expectation: 10000
length(unique(out$random[out$part == "dt_ex1"]))        # expectation: 6000
length(unique(out$random[out$part == "dt_ex2"]))        # expectation: 6000


# ====== Stop Parallel Computing ======
stopCluster(cl)

解决方案

R(包括set.seedrunif)使用的随机生成器是全局的,适用于整个应用程序.

您的问题似乎正在发生,因为生成器的访问在并行进程之间共享,但在这些进程之间不同步(也就是说,它不是线程安全的"),因此每个进程都有自己的视图生成器的状态(因此,由于这种不同步的访问,不同的进程可以得出完全相同的随机数).相反,您应该为每个并行进程(在这种情况下为每个模拟)提供自己的随机生成器,该随机生成器不会在进程之间共享,并且要考虑的许多问题之一.数字是您所关心的.


事实证明,根本的问题更多是由于数据帧在进程之间共享而不是R的全局RNG引起的.参见以下问题使用R进行多线程计算:如何得到所有不同的随机数?.

I run a simulation using foreach and doParallel and struggling with random numbers (named random in the code).

In a nutshell: I simulate a football league, randomly generating winners of all the matches and corresponding results. In dt_base no match was played, in dt_ex1 and dt_ex2 results of 4 matches are known already. All unknown results should be simulated.

In the League Simulation Code at the bottom of this post I set 1000 simulations, split into 100 chunks (the forloop is used to send data to PostgreSQL and reduce RAM usage in the full code I use). I expect all the random numbers to be different (don't even insist on reproducible results).

1. When running the code as given, one should achieve the goal of all different random numbers.

> # ====== Distinct Random Numbers ======
> length(unique(out$random))                              # expectation: 22000
[1] 22000
> length(unique(out$random[out$part == "base"]))          # expectation: 10000
[1] 10000
> length(unique(out$random[out$part == "dt_ex1"]))        # expectation: 6000
[1] 6000
> length(unique(out$random[out$part == "dt_ex2"]))        # expectation: 6000
[1] 6000

2. Now please uncomment the pieces of code which assigns the final score *[tmp_sim] = 3 (should be lines 60,61,67,68 with !!! on them) and run it again.

> # ====== Distinct Random Numbers ======
> length(unique(out$random))                              # expectation: 22000
[1] 10360
> length(unique(out$random[out$part == "base"]))          # expectation: 10000
[1] 10000
> length(unique(out$random[out$part == "dt_ex1"]))        # expectation: 6000
[1] 180
> length(unique(out$random[out$part == "dt_ex2"]))        # expectation: 6000
[1] 180

That is when it gets messed up and it doesn't make sense to me. random inside iter is always the same for dt_ex1 and dt_ex2 when adding couple of numbers into these dataframes.

Are you experiencing the same effect? Any idea what is going on please?

I tried R versions 3.5.3 and 3.6.3. Also tried doRNG package. Always the same problem.

League Simulation Code

# League Simulation
rm(list = ls())
set.seed(666)
cat("\014")
library(sqldf)
library(plyr)
library(dplyr)

# ====== User Functions ======
comb4 = function(x, ...) { #function for combining foreach output
  Map(rbind, x, ...)
}

# ====== Data Preparation ======
dt_base = data.frame(id = 1:10,
                  part = rep("base",10),
                  random = NA)

dt_ex1 = data.frame(id = 1:10,
                         part = rep("dt_ex1",10),
                         HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),  # Home Goals
                         AG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),  # Away Goals
                         random = NA)

dt_ex2 = data.frame(id = 1:10,
                            part = rep("dt_ex2",10),
                         HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),  # Home Goals
                         AG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),  # Away Goals
                         random = NA)

# ====== Set Parallel Computing ======
library(foreach)
library(doParallel)

cl = makeCluster(3, outfile = "")
registerDoParallel(cl)

# ====== SIMULATION ======
nsim = 1000                # number of simulations
iterChunk = 100            # split nsim into this many chunks
out = data.frame()    # prepare output DF
for(iter in 1:ceiling(nsim/iterChunk)){
  strt = Sys.time()
  
  out_iter = 
    foreach(i = 1:iterChunk, .combine = comb4, .multicombine = TRUE, .maxcombine = 100000, .inorder = FALSE, .verbose = FALSE,
            .packages = c("plyr", "dplyr", "sqldf")) %dopar% {
              
              ## PART 1
              # simulation number
              id_sim = iterChunk * (iter - 1) + i
              
              # First random numbers set
              dt_base[,"random"] = runif(nrow(dt_base))
              
              
              ## PART 2
              tmp_sim = is.na(dt_ex1$HG) # no results yet
              dt_ex1$random[tmp_sim] = runif(sum(tmp_sim))
              # dt_ex1$HG[tmp_sim] = 3   # !!!
              # dt_ex1$AG[tmp_sim] = 3   # !!!
              
              
              ## PART 3
              tmp_sim = is.na(dt_ex2$HG) # no results yet
              dt_ex2$random[tmp_sim] = runif(sum(tmp_sim))
              # dt_ex2$HG[tmp_sim] = 3   # !!!
              # dt_ex2$AG[tmp_sim] = 3   # !!!
              
              
              # ---- Save Results
              zapasy = rbind.data.frame(dt_base[,c("id","part","random")],
                                        dt_ex1[,c("id","part","random")]
                                        ,dt_ex2[,c("id","part","random")]
              )
              zapasy$id_sim = id_sim
              zapasy$iter = iter
              zapasy$i = i
              
              out_i = list(zapasy = zapasy)
              
              print(Sys.time())
              return(out_i)
            }#i;sim_forcycle
  
  out = rbind.data.frame(out,subset(out_iter$zapasy, !is.na(random)))
  
  fnsh = Sys.time()
  cat(" [",iter,"] ",fnsh - strt, sep = "")
  
}#iter


# ====== Distinct Random Numbers ======
length(unique(out$random))                              # expectation: 22000
length(unique(out$random[out$part == "base"]))          # expectation: 10000
length(unique(out$random[out$part == "dt_ex1"]))        # expectation: 6000
length(unique(out$random[out$part == "dt_ex2"]))        # expectation: 6000


# ====== Stop Parallel Computing ======
stopCluster(cl)

解决方案

The random generator used by R (including by set.seed and runif) is global and applies to the whole application.

It appears that your problem is happening because the generator's access is shared between parallel processes, but is not synchronized between these processes (that is, it's not "thread safe"), so that each process has its own view of the generator's state (so that, as a result, different processes can draw exactly the same random numbers due to this unsynchronized access). Instead, you should give each parallel process (each simulation in this case) its own random generator that's not shared between processes, and seed each process (or simulation) accordingly.

Multithreading is one of the many issues to consider when reproducible "random" numbers are something you care about.


As it turns out, the underlying issue is caused more by data frames being shared among processes, rather than R's global RNG. See this question Multithread computation with R: how to get all different random numbers? .

这篇关于一段R代码会影响foreach输出中的随机数吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆