使用R进行多线程计算:如何获得所有不同的随机数? [英] Multithread computation with R: how to get all different random numbers?

查看:52
本文介绍了使用R进行多线程计算:如何获得所有不同的随机数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任何人都知道如何在以下代码中获得所有不同的随机数吗?例如.与 doRNG 包一起使用?我不在乎可重复性.

Anyone knows how to get all the random numbers different in the following code? E.g. with doRNG package? I don't care about reproducibility.

编辑:纯属偶然的副本被接受.

Duplicates by pure chance are accepted.

rm(list = ls())
set.seed(666)
cat("\014")
library(plyr)
library(dplyr)
library(doRNG)

# ====== Data Preparation ======
dt = data.frame(id = 1:10,
                part = rep("dt",10),
                HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),
                random = NA)

# ====== Set Parallel Computing ======
library(foreach)
library(doParallel)

cl = makeCluster(3, outfile = "")
registerDoParallel(cl)

# ====== SIMULATION ======
nsim = 1000                # number of simulations
iterChunk = 100            # split nsim into this many chunks
out = data.frame()    # prepare output DF
for(iter in 1:ceiling(nsim/iterChunk)){
  strt = Sys.time()
  
  out_iter = 
    foreach(i = 1:iterChunk, .combine = rbind, .multicombine = TRUE, .maxcombine = 100000, .inorder = FALSE, .verbose = FALSE,
            .packages = c("plyr", "dplyr")) %dopar% {
              
              # simulation number
              id_sim = iterChunk * (iter - 1) + i

              ## Generate random numbers
              tmp_sim = is.na(dt$HG) # no results yet
              dt$random[tmp_sim] = runif(sum(tmp_sim))
              dt$HG[tmp_sim] = 3

              # Save Results
              dt$id_sim = id_sim
              dt$iter = iter
              dt$i = i
              
              print(Sys.time())
              return(dt)
            }#i;sim_forcycle
  
  out = rbind.data.frame(out,subset(out_iter, !is.na(random)))
  
  fnsh = Sys.time()
  cat(" [",iter,"] ",fnsh - strt, sep = "")
}#iter

# ====== Stop Parallel Computing ======
stopCluster(cl)

# ====== Distinct Random Numbers ======
length(unique(out$random))              # expectation: 6000

我已经为此苦苦挣扎了2天.我提出这个问题关于随机数的一般性回答.

I have been strugling with this for 2 days. I asked this question earlier with only general response about random numbers.

在这里,我想请一个解决方案(如果有人知道)如何以所有随机数都不同的方式设置 doRNG 包选项(或类似的包).遍历所有循环.

Here I would like to ask for a solution (if anybody knows) how to set doRNG package options (or similar package) in a way that all the random numbers are different. Across all the loops.

我尝试了很多doRNG设置,但仍然无法正常工作.在两台不同的计算机上试用了R版本3.5.3和3.6.3.

I have tried tons of doRNG settings and I still can't get it to work. Tried R versions 3.5.3 and 3.6.3 on two different computers.

更新以下是与@Limey的讨论

UPDATE Following discussion with @Limey

该代码的目的是模拟足球比赛.由于模拟量很大,因此我使用 iterChunk 来拆分"将模拟分为可管理的部分,并在每个 iter 之后将数据发送到PostgreSQL数据库中,以使模拟不会使RAM过载.有些比赛已经取得了真实的结果,并且填写了 HG (主目标).我想模拟其余的比赛.

Purpose of the code is to simulate football matches. As the simulation is large, I use iterChunk to "split" the simulation into managable parts and after each iter send the data into PostgreSQL database so the simulation doesn't overload RAM. Some matches already have real world results and have HG (home goals) filled in. I want to simulate the rest.

iterChunk 设置为 1 时,一切都很好.增加 iterChunk 会导致在 iter 中生成相同的数字.例如,当我将 nsim 设置为 100 并将 iterChunk 设置为 10 时.(所有比赛都模拟了100次,其中10次循环模拟了10次).我希望有600个随机数(每个匹配都在所有循环中进行独立模拟).但是,我只能得到180-按照逻辑:3个核* 6个匹配* 10个iterChunks.)使用2个工作线程,我确实得到120个不同的随机数(2 * 6 * 10)

When setting iterChunk to 1 everything is fine. Increasing iterChunk leads to generation of same numbers within iter. For example when I set nsim to 100 and iterChunk to 10. (All matches simulated 100 times, 10 times in 10 loops). I expect 600 random numbers (each match independently simulated accross all the loops). However I only get 180 - following the logic: 3 cores * 6 matches * 10 iterChunks.) Using 2 workers I do get 120 distinct random numbers (2 * 6 * 10)

此外:排除 dt $ HG [tmp_sim] = 3 我确实获得了所有随机数,无论设置如何.

Furthermore: exluding dt$HG[tmp_sim] = 3 I do get all random numbers different with whatever setting.

要了解该问题,我建议:

To understand the problem, I suggest:

  1. 按原样运行代码.(可能将 nsim 设置为 100 ,将 iterChunk 设置为 10 ),您将获得180个不同的随机数.nsim&的数量较少iterChunk可能会按预期工作.
  2. 注释掉 dt $ HG [tmp_sim] = 3 .您将获得6000个不同的随机数(如果更改 nsim iterChunk ,则为600)
  1. Run the code as is. (possibly setting nsim to 100 and iterChunk to 10) You will get 180 different random numbers. With lower number of nsim & iterChunk things may work as expected.
  2. Comment out dt$HG[tmp_sim] = 3. You will get 6000 different random numbers (600 if you change nsim and iterChunk)

第二步中的代码分配主队得分的目标.看来我无法克服某种错误.甚至某人得到相同结果并且不知道为什么的信息也会有所帮助-它将使我自己的愚蠢负担从我中摆脱出来.

The code in 2nd step assigns goals scored by home team. It looks like some kind of bug I can't get over. Even information that someone gets the same result and doesn't know why will be helpful - it will lift the weight of my own stupidity out of me.

谢谢,我非常感谢您所做的一切.

Thank you, I highly appreciate any effort.

推荐答案

在淋浴时,我意识到OP代码的问题是什么.回想起来很简单,很明显:所有循环和并行进程都在同一个对象上工作- dt 数据帧.因此,他们不断地覆盖每个变更所做的更改,并且在外部循环结束时,您只有最后一个循环完成的更改的多个副本才能完成.解决方案也很简单:处理 dt 数据帧的副本.

I realised what the problem with OP's code was whilst I was in the shower. It's simple, and obvious in retrospect: all the loops and parallel processes are working on the same object - the dt data frame. So they're constantly overwriting the changes that each makes, and at the end of the outer loop, you just have multiple copies of the changes made by the last loop to complete. The solution is equally simple: work on a copy of the dt data frame.

为了最小化更改,我将 dt 重命名为 baseDT

To minimise the changes, I renamed dt to baseDT

# ====== Data Preparation ======
baseDT = data.frame(id = 1:10,
                part = rep("dt",10),
                HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),
                random = NA)

,然后将其复制到 foreach 循环的顶部

and then took a copy of it at the top of the foreach loop

  out_iter = foreach(i = 1:iterChunk, 
               .combine = rbind, .multicombine = TRUE, .maxcombine = 100000, 
               .inorder = FALSE, .verbose = FALSE,
               .packages = c("plyr", "dplyr")) %dopar% {
    dt <- baseDT

这给

> length(unique(out$random))              # expectation: 6000
[1] 6000

符合预期.

这篇关于使用R进行多线程计算:如何获得所有不同的随机数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆