如何在R中为Monte Carlo创建更有效的仿真循环 [英] How to create a more efficient simulation loop for Monte Carlo in R

查看:47
本文介绍了如何在R中为Monte Carlo创建更有效的仿真循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此练习的目的是创建营养摄入量的总体分布.早期的数据中有重复的度量,这些度量已被删除,因此每一行在数据框中都是唯一的人.

The purpose of this exercise is to create a population distribution of nutrient intake values. There were repeated measures in the earlier data, these have been removed so each row is a unique person in the data frame.

我有这段代码,当用少量的数据帧行进行测试时,它可以很好地工作.对于所有7135行,速度非常慢.我尝试计时,但是当我的计算机上经过的运行时间为15小时时,我将其崩溃了. system.time结果为Timing stopped at: 55625.08 2985.39 58673.87.

I have this code, which works quite well when tested with a small number of my data frame rows. For all 7135 rows, it is very slow. I tried to time it, but I crashed it out when the elapsed running time on my machine was 15 hours. The system.time results were Timing stopped at: 55625.08 2985.39 58673.87.

对于加快仿真速度,我将不胜感激:

I would appreciate any comments on speeding up the simulation:

Male.MC <-c()
for (j in 1:100)            {
for (i in 1:nrow(Male.Distrib))  {
    u2        <- Male.Distrib$stddev_u2[i] * rnorm(1, mean = 0, sd = 1)
    mc_bca    <- Male.Distrib$FixedEff[i] + u2
    temp      <- Lambda.Value*mc_bca+1
    ginv_a    <- temp^(1/Lambda.Value)
    d2ginv_a  <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))
    mc_amount <- ginv_a + d2ginv_a * Male.Resid.Var / 2
z <- data.frame(
     RespondentID = Male.Distrib$RespondentID[i], 
     Subgroup     = Male.Distrib$Subgroup[i], 
     mc_amount    = mc_amount,
     IndvWeight   = Male.Distrib$INDWTS[i]/100
     )

Male.MC <- as.data.frame(rbind(Male.MC,z))
    }
}

为我的数据集中的7135个观测值中的每个观测值创建了100个模拟营养值,然后将其转换回原始测量水平(模拟使用BoxCox转换营养值的非线性混合效应模型的结果).

For each of the 7135 observations in my dataset, 100 simulated nutrient values are created, then back transformed to the original measurement level (the simulation is using the results from a nonlinear mixed effect model on BoxCox transformed nutrient values).

我宁愿不使用for循环,因为我读到它们在R中效率低下,但是我对基于apply的选项无法充分理解. R在独立计算机上运行,​​如果这会影响有关更改代码的建议,通常这将是运行Windows 7变体的标准Dell型台式机.

I would prefer not to use for loops, as I read that they are inefficient in R but I do not understand enough about options based on apply to use those as an alternative. R is being run on stand-alone machines, normally this would be a standard Dell-type desktop running a Windows 7 variant, if that influences the recommendations for how to change the code.

更新:要重现此内容以进行测试, Lambda.Value = 0.4和Male.Resid.Var = 12.1029420429778和Male.Distrib$stddev_u2在所有观察结果中都是恒定值.

Update: To reproduce this for testing, Lambda.Value=0.4 and Male.Resid.Var=12.1029420429778 and Male.Distrib$stddev_u2 is a constant value over all observations.

str(Male.Distrib)

'data.frame':   7135 obs. of  14 variables:
 $ RndmEff     : num  1.34 -5.86 -3.65 2.7 3.53 ...
 $ RespondentID: num  9966 9967 9970 9972 9974 ...
 $ Subgroup    : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 4 3 2 4 1 4 2 5 1 2 ...
 $ RespondentID: int  9966 9967 9970 9972 9974 9976 9978 9979 9982 9993 ...
 $ Replicates  : num  41067 2322 17434 21723 375 ...
 $ IntakeAmt   : num  33.45 2.53 9.58 43.34 55.66 ...
 $ RACE        : int  2 3 2 2 3 2 2 2 2 1 ...
 $ INDWTS      : num  41067 2322 17434 21723 375 ...
 $ TOTWTS      : num  1.21e+08 1.21e+08 1.21e+08 1.21e+08 1.21e+08 ...
 $ GRPWTS      : num  41657878 22715139 10520535 41657878 10791729 ...
 $ NUMSUBJECTS : int  1466 1100 1424 1466 1061 1466 1424 1252 1061 1424 ...
 $ TOTSUBJECTS : int  7135 7135 7135 7135 7135 7135 7135 7135 7135 7135 ...
 $ FixedEff    : num  6.09 6.76 7.08 6.09 6.18 ...
 $ stddev_u2   : num  2.65 2.65 2.65 2.65 2.65 ...

head(Male.Distrib)

    RndmEff RespondentID Subgroup RespondentID Replicates IntakeAmt RACE INDWTS    TOTWTS   GRPWTS NUMSUBJECTS TOTSUBJECTS  FixedEff stddev_u2
1  1.343753         9966        6         9966      41067 33.449808    2  41067 120622201 41657878        1466        7135  6.089918  2.645938
2 -5.856516         9967        5         9967       2322  2.533528    3   2322 120622201 22715139        1100        7135  6.755664  2.645938
3 -3.648339         9970        4         9970      17434  9.575439    2  17434 120622201 10520535        1424        7135  7.079757  2.645938
4  2.697533         9972        6         9972      21723 43.340180    2  21723 120622201 41657878        1466        7135  6.089918  2.645938
5  3.531878         9974        3         9974        375 55.660607    3    375 120622201 10791729        1061        7135  6.176319  2.645938
6  6.627767         9976        6         9976      48889 91.480049    2  48889 120622201 41657878        1466        7135  6.089918  2.645938

更新2:导致NaN结果的函数行是

Update 2: the line of the function that is causing the NaN results is

d2ginv_a  <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))

感谢大家的帮助和评论,以及答复的速度.

Thanks to everyone for their assistance and comments, and also for the speed of responses.

更新:@Ben Bolker是正确的,它是导致NaN问题的temp负值.我在进行一些测试时就错过了这一点(在注释掉函数以便仅返回temp值并调用我的结果数据框Test之后).这段代码重现了NaN问题:

Update: @Ben Bolker is correct that it is the negative temp values that are causing the NaN issue. I missed this with some testing (after commenting out the function so that only the temp values are returned, and calling my result data frame Test). This code reproduces the NaN issue:

> min(Test)
[1] -2.103819
> min(Test)^(1/Lambda.Value)
[1] NaN

但是将值作为一个值放入,然后运行相同的(?)计算会给我一个结果,因此在进行手动计算时我错过了这一点:

But putting the value in as a value and then running the same(?) calculation gives me a result, so I missed this when doing manual calculations:

> -2.103819^(1/Lambda.Value) 
[1] -6.419792

我现在有一些工作代码(我认为)使用矢量化,而且速度非常快.万一其他人遇到此问题,我将在下面发布工作代码.我必须添加一个最小值以防止< 0问题与计算.感谢所有帮助的人和咖啡.我确实尝试将rnorm结果放入数据帧,这确实减慢了速度,以这种方式创建它们,然后使用cbind确实非常快. Male.Distrib是我的7135个观测值的完整数据框架,但是此代码应在我之前发布的简化版本(未经测试)上有效.

I now have working code that (I think) uses vectorization, and it is blindingly fast. Just in case anyone else has this issue, I am posting the working code below. I've had to add a minimum to prevent the <0 issue with the calculation. Thank you to everyone who helped, and to coffee. I did try putting the rnorm results to a dataframe, and that really slowed things down, creating them this way and then using cbind is really quick. Male.Distrib is my full data frame of 7135 observations, but this code should work on the cutdown version I posted earlier (not tested).

Min_bca <- ((.5*min(Male.AddSugar$IntakeAmt))^Lambda.Value-1)/Lambda.Value
Test <- Male.Distrib[rep(seq.int(1,nrow(Male.Distrib)), 100), 1:ncol(Male.Distrib)]
RnormOutput <- rnorm(nrow(Test),0,1)
Male.Final <- cbind(Test,RnormOutput)
Male.Final$mc_bca    <- Male.Final$FixedEff + (Male.Final$stddev_u2 *     Male.Final$RnormOutput)
Male.Final$temp      <- ifelse(Lambda.Value*Male.Final$mc_bca+1 > Lambda.Value*Min_bca+1,
                           Lambda.Value*Male.Final$mc_bca+1, Lambda.Value*Min_bca+1)
Male.Final$ginv_a    <- Male.Final$temp^(1/Lambda.Value)
Male.Final$d2ginv_a  <- ifelse(0 > (1-Lambda.Value)*Male.Final$temp^(1/Lambda.Value-2),
                           0, (1-Lambda.Value)*Male.Final$temp^(1/Lambda.Value-2))
Male.Final$mc_amount <- Male.Final$ginv_a + Male.Final$d2ginv_a * Male.Resid.Var / 2

当天的经验教训:

  • 如果您尝试执行我之前尝试的操作,似乎不会在循环中对分布函数进行重新采样
  • 您不能以我尝试的方式使用max(),因为它从列中返回最大值,而我希望从两个值中获得最大值. ifelse语句是要替换的语句.
  • a distribution function does not appear to be resampled in a loop if you try to do what I was trying earlier
  • you can't use max() the way I tried, as it returns the maximum value from the column, whereas I wanted the maximum from two values. The ifelse statement is the replacement one to do.

推荐答案

以下是解决2个最大速度问题的方法:

Here is an approach that addresses the 2 biggest speed issues:

  1. 我们不会一次遍历所有观测值(i),而是一次计算所有观测值.
  2. 我们使用replicate代替循环遍历MC复制(j),这是为此目的而简化的apply.
  1. Instead of looping over observations(i), we compute them all at once.
  2. Instead of looping over MC replications (j), we use replicate, which is a simplified apply meant for this purpose.

首先,我们加载数据集并为您正在做的事情定义一个函数.

First we load the dataset and define a function for what you were doing.

Male.Distrib = read.table('MaleDistrib.txt', check.names=F)

getMC <- function(df, Lambda.Value=0.4, Male.Resid.Var=12.1029420429778) {
  u2        <- df$stddev_u2 * rnorm(nrow(df), mean = 0, sd = 1)
  mc_bca    <- df$FixedEff + u2
  temp      <- Lambda.Value*mc_bca+1
  ginv_a    <- temp^(1/Lambda.Value)
  d2ginv_a  <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))
  mc_amount <- ginv_a + d2ginv_a * Male.Resid.Var / 2
  mc_amount
}

然后我们将其复制很多次.

Then we replicate it a bunch of times.

> replicate(10, getMC(Male.Distrib))
         [,1]      [,2]     [,3]     [,4]      [,5]     [,6]     [,7]     [,8]     [,9]    [,10]
[1,] 36.72374 44.491777 55.19637 23.53442 23.260609 49.56022 31.90657 25.26383 25.31197 20.58857
[2,] 29.56115 18.593496 57.84550 22.01581 22.906528 22.15470 29.38923 51.38825 13.45865 21.47531
[3,] 61.27075 10.140378 75.64172 28.10286  9.652907 49.25729 23.82104 31.77349 16.24840 78.02267
[4,] 49.42798 22.326136 33.87446 14.00084 25.107143 25.75241 30.20490 33.14770 62.86563 27.33652
[5,] 53.45546  9.673162 22.66676 38.76392 30.786100 23.42267 28.40211 35.95015 43.75506 58.83676
[6,] 34.72440 23.786004 63.57919  8.08238 12.636745 34.11844 14.88339 21.93766 44.53451 51.12331

然后您可以重新格式化,添加ID等,但这是主要计算部分的想法.祝你好运!

Then you can reformat, add IDs, etc., but this is the idea for the main computational part. Good luck!

这篇关于如何在R中为Monte Carlo创建更有效的仿真循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆