一段R代码会影响foreach输出中的随机数吗? [英] Can piece of R code influence random numbers in foreach output?
问题描述
我使用foreach
和doParallel
进行了仿真,并使用随机数(在代码中命名为random
)进行挣扎.
dt_base
中没有比赛,在dt_ex1
和dt_ex2
中已经知道4场比赛的结果.所有未知的结果都应进行模拟.
在这篇文章底部的《联盟模拟代码》中,我设置了1000个模拟,分为100个块(forloop用于将数据发送到PostgreSQL并减少我使用的完整代码中的RAM使用量). 我希望所有随机数都不同(甚至不要坚持要重现结果).
1.在按照给定的代码运行时,应该达到所有不同随机数的目标.
> # ====== Distinct Random Numbers ======
> length(unique(out$random)) # expectation: 22000
[1] 22000
> length(unique(out$random[out$part == "base"])) # expectation: 10000
[1] 10000
> length(unique(out$random[out$part == "dt_ex1"])) # expectation: 6000
[1] 6000
> length(unique(out$random[out$part == "dt_ex2"])) # expectation: 6000
[1] 6000
2.现在,请取消注释分配最终得分的代码段 *[tmp_sim] = 3
(应该是60、61、67、68行上带有!!!
的行),然后再次运行.
> # ====== Distinct Random Numbers ======
> length(unique(out$random)) # expectation: 22000
[1] 10360
> length(unique(out$random[out$part == "base"])) # expectation: 10000
[1] 10000
> length(unique(out$random[out$part == "dt_ex1"])) # expectation: 6000
[1] 180
> length(unique(out$random[out$part == "dt_ex2"])) # expectation: 6000
[1] 180
那是当它弄乱了,对我来说没有意义的时候.当在这些数据帧中添加几个数字时,iter
内的random
对于dt_ex1
和dt_ex2
始终是相同的.
您是否遇到相同的效果?你知道发生了什么吗?
我尝试了R版本3.5.3和3.6.3.还尝试了doRNG
软件包.总是一样的问题.
联盟模拟代码
# League Simulation
rm(list = ls())
set.seed(666)
cat("\014")
library(sqldf)
library(plyr)
library(dplyr)
# ====== User Functions ======
comb4 = function(x, ...) { #function for combining foreach output
Map(rbind, x, ...)
}
# ====== Data Preparation ======
dt_base = data.frame(id = 1:10,
part = rep("base",10),
random = NA)
dt_ex1 = data.frame(id = 1:10,
part = rep("dt_ex1",10),
HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA), # Home Goals
AG = c(1,3,6,NA,NA,2,NA,NA,NA,NA), # Away Goals
random = NA)
dt_ex2 = data.frame(id = 1:10,
part = rep("dt_ex2",10),
HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA), # Home Goals
AG = c(1,3,6,NA,NA,2,NA,NA,NA,NA), # Away Goals
random = NA)
# ====== Set Parallel Computing ======
library(foreach)
library(doParallel)
cl = makeCluster(3, outfile = "")
registerDoParallel(cl)
# ====== SIMULATION ======
nsim = 1000 # number of simulations
iterChunk = 100 # split nsim into this many chunks
out = data.frame() # prepare output DF
for(iter in 1:ceiling(nsim/iterChunk)){
strt = Sys.time()
out_iter =
foreach(i = 1:iterChunk, .combine = comb4, .multicombine = TRUE, .maxcombine = 100000, .inorder = FALSE, .verbose = FALSE,
.packages = c("plyr", "dplyr", "sqldf")) %dopar% {
## PART 1
# simulation number
id_sim = iterChunk * (iter - 1) + i
# First random numbers set
dt_base[,"random"] = runif(nrow(dt_base))
## PART 2
tmp_sim = is.na(dt_ex1$HG) # no results yet
dt_ex1$random[tmp_sim] = runif(sum(tmp_sim))
# dt_ex1$HG[tmp_sim] = 3 # !!!
# dt_ex1$AG[tmp_sim] = 3 # !!!
## PART 3
tmp_sim = is.na(dt_ex2$HG) # no results yet
dt_ex2$random[tmp_sim] = runif(sum(tmp_sim))
# dt_ex2$HG[tmp_sim] = 3 # !!!
# dt_ex2$AG[tmp_sim] = 3 # !!!
# ---- Save Results
zapasy = rbind.data.frame(dt_base[,c("id","part","random")],
dt_ex1[,c("id","part","random")]
,dt_ex2[,c("id","part","random")]
)
zapasy$id_sim = id_sim
zapasy$iter = iter
zapasy$i = i
out_i = list(zapasy = zapasy)
print(Sys.time())
return(out_i)
}#i;sim_forcycle
out = rbind.data.frame(out,subset(out_iter$zapasy, !is.na(random)))
fnsh = Sys.time()
cat(" [",iter,"] ",fnsh - strt, sep = "")
}#iter
# ====== Distinct Random Numbers ======
length(unique(out$random)) # expectation: 22000
length(unique(out$random[out$part == "base"])) # expectation: 10000
length(unique(out$random[out$part == "dt_ex1"])) # expectation: 6000
length(unique(out$random[out$part == "dt_ex2"])) # expectation: 6000
# ====== Stop Parallel Computing ======
stopCluster(cl)
R(包括set.seed
和runif
)使用的随机生成器是全局的,适用于整个应用程序.
您的问题似乎正在发生,因为生成器的访问在并行进程之间共享,但在这些进程之间不同步(也就是说,它不是线程安全的"),因此每个进程都有自己的视图生成器的状态(因此,由于这种不同步的访问,不同的进程可以得出完全相同的随机数).相反,您应该为每个并行进程(在这种情况下为每个模拟)提供自己的随机生成器,该随机生成器不会在进程之间共享,并且要考虑的许多问题之一.数字是您所关心的.
事实证明,根本的问题更多是由于数据帧在进程之间共享而不是R的全局RNG引起的.参见以下问题使用R进行多线程计算:如何得到所有不同的随机数?.
I run a simulation using foreach
and doParallel
and struggling with random numbers (named random
in the code).
In a nutshell: I simulate a football league, randomly generating winners of all the matches and corresponding results. In dt_base
no match was played, in dt_ex1
and dt_ex2
results of 4 matches are known already. All unknown results should be simulated.
In the League Simulation Code at the bottom of this post I set 1000 simulations, split into 100 chunks (the forloop is used to send data to PostgreSQL and reduce RAM usage in the full code I use). I expect all the random numbers to be different (don't even insist on reproducible results).
1. When running the code as given, one should achieve the goal of all different random numbers.
> # ====== Distinct Random Numbers ======
> length(unique(out$random)) # expectation: 22000
[1] 22000
> length(unique(out$random[out$part == "base"])) # expectation: 10000
[1] 10000
> length(unique(out$random[out$part == "dt_ex1"])) # expectation: 6000
[1] 6000
> length(unique(out$random[out$part == "dt_ex2"])) # expectation: 6000
[1] 6000
2. Now please uncomment the pieces of code which assigns the final score *[tmp_sim] = 3
(should be lines 60,61,67,68 with !!!
on them) and run it again.
> # ====== Distinct Random Numbers ======
> length(unique(out$random)) # expectation: 22000
[1] 10360
> length(unique(out$random[out$part == "base"])) # expectation: 10000
[1] 10000
> length(unique(out$random[out$part == "dt_ex1"])) # expectation: 6000
[1] 180
> length(unique(out$random[out$part == "dt_ex2"])) # expectation: 6000
[1] 180
That is when it gets messed up and it doesn't make sense to me. random
inside iter
is always the same for dt_ex1
and dt_ex2
when adding couple of numbers into these dataframes.
Are you experiencing the same effect? Any idea what is going on please?
I tried R versions 3.5.3 and 3.6.3. Also tried doRNG
package. Always the same problem.
League Simulation Code
# League Simulation
rm(list = ls())
set.seed(666)
cat("\014")
library(sqldf)
library(plyr)
library(dplyr)
# ====== User Functions ======
comb4 = function(x, ...) { #function for combining foreach output
Map(rbind, x, ...)
}
# ====== Data Preparation ======
dt_base = data.frame(id = 1:10,
part = rep("base",10),
random = NA)
dt_ex1 = data.frame(id = 1:10,
part = rep("dt_ex1",10),
HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA), # Home Goals
AG = c(1,3,6,NA,NA,2,NA,NA,NA,NA), # Away Goals
random = NA)
dt_ex2 = data.frame(id = 1:10,
part = rep("dt_ex2",10),
HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA), # Home Goals
AG = c(1,3,6,NA,NA,2,NA,NA,NA,NA), # Away Goals
random = NA)
# ====== Set Parallel Computing ======
library(foreach)
library(doParallel)
cl = makeCluster(3, outfile = "")
registerDoParallel(cl)
# ====== SIMULATION ======
nsim = 1000 # number of simulations
iterChunk = 100 # split nsim into this many chunks
out = data.frame() # prepare output DF
for(iter in 1:ceiling(nsim/iterChunk)){
strt = Sys.time()
out_iter =
foreach(i = 1:iterChunk, .combine = comb4, .multicombine = TRUE, .maxcombine = 100000, .inorder = FALSE, .verbose = FALSE,
.packages = c("plyr", "dplyr", "sqldf")) %dopar% {
## PART 1
# simulation number
id_sim = iterChunk * (iter - 1) + i
# First random numbers set
dt_base[,"random"] = runif(nrow(dt_base))
## PART 2
tmp_sim = is.na(dt_ex1$HG) # no results yet
dt_ex1$random[tmp_sim] = runif(sum(tmp_sim))
# dt_ex1$HG[tmp_sim] = 3 # !!!
# dt_ex1$AG[tmp_sim] = 3 # !!!
## PART 3
tmp_sim = is.na(dt_ex2$HG) # no results yet
dt_ex2$random[tmp_sim] = runif(sum(tmp_sim))
# dt_ex2$HG[tmp_sim] = 3 # !!!
# dt_ex2$AG[tmp_sim] = 3 # !!!
# ---- Save Results
zapasy = rbind.data.frame(dt_base[,c("id","part","random")],
dt_ex1[,c("id","part","random")]
,dt_ex2[,c("id","part","random")]
)
zapasy$id_sim = id_sim
zapasy$iter = iter
zapasy$i = i
out_i = list(zapasy = zapasy)
print(Sys.time())
return(out_i)
}#i;sim_forcycle
out = rbind.data.frame(out,subset(out_iter$zapasy, !is.na(random)))
fnsh = Sys.time()
cat(" [",iter,"] ",fnsh - strt, sep = "")
}#iter
# ====== Distinct Random Numbers ======
length(unique(out$random)) # expectation: 22000
length(unique(out$random[out$part == "base"])) # expectation: 10000
length(unique(out$random[out$part == "dt_ex1"])) # expectation: 6000
length(unique(out$random[out$part == "dt_ex2"])) # expectation: 6000
# ====== Stop Parallel Computing ======
stopCluster(cl)
The random generator used by R (including by set.seed
and runif
) is global and applies to the whole application.
It appears that your problem is happening because the generator's access is shared between parallel processes, but is not synchronized between these processes (that is, it's not "thread safe"), so that each process has its own view of the generator's state (so that, as a result, different processes can draw exactly the same random numbers due to this unsynchronized access). Instead, you should give each parallel process (each simulation in this case) its own random generator that's not shared between processes, and seed each process (or simulation) accordingly.
Multithreading is one of the many issues to consider when reproducible "random" numbers are something you care about.
As it turns out, the underlying issue is caused more by data frames being shared among processes, rather than R's global RNG. See this question Multithread computation with R: how to get all different random numbers? .
这篇关于一段R代码会影响foreach输出中的随机数吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!