如何在R中为Monte Carlo创建更有效的仿真循环 [英] How to create a more efficient simulation loop for Monte Carlo in R

查看：47 发布时间：2020/5/4 4:52:23 r loops simulation montecarlo performance

本文介绍了如何在R中为Monte Carlo创建更有效的仿真循环的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

此练习的目的是创建营养摄入量的总体分布.早期的数据中有重复的度量，这些度量已被删除，因此每一行在数据框中都是唯一的人.

The purpose of this exercise is to create a population distribution of nutrient intake values. There were repeated measures in the earlier data, these have been removed so each row is a unique person in the data frame.

我有这段代码，当用少量的数据帧行进行测试时，它可以很好地工作.对于所有7135行，速度非常慢.我尝试计时，但是当我的计算机上经过的运行时间为15小时时，我将其崩溃了. system.time结果为Timing stopped at: 55625.08 2985.39 58673.87.

I have this code, which works quite well when tested with a small number of my data frame rows. For all 7135 rows, it is very slow. I tried to time it, but I crashed it out when the elapsed running time on my machine was 15 hours. The system.time results were Timing stopped at: 55625.08 2985.39 58673.87.

对于加快仿真速度，我将不胜感激:

I would appreciate any comments on speeding up the simulation:

Male.MC <-c()
for (j in 1:100)            {
for (i in 1:nrow(Male.Distrib))  {
    u2        <- Male.Distrib$stddev_u2[i] * rnorm(1, mean = 0, sd = 1)
    mc_bca    <- Male.Distrib$FixedEff[i] + u2
    temp      <- Lambda.Value*mc_bca+1
    ginv_a    <- temp^(1/Lambda.Value)
    d2ginv_a  <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))
    mc_amount <- ginv_a + d2ginv_a * Male.Resid.Var / 2
z <- data.frame(
     RespondentID = Male.Distrib$RespondentID[i], 
     Subgroup     = Male.Distrib$Subgroup[i], 
     mc_amount    = mc_amount,
     IndvWeight   = Male.Distrib$INDWTS[i]/100
     )

Male.MC <- as.data.frame(rbind(Male.MC,z))
    }
}

为我的数据集中的7135个观测值中的每个观测值创建了100个模拟营养值，然后将其转换回原始测量水平(模拟使用BoxCox转换营养值的非线性混合效应模型的结果).

For each of the 7135 observations in my dataset, 100 simulated nutrient values are created, then back transformed to the original measurement level (the simulation is using the results from a nonlinear mixed effect model on BoxCox transformed nutrient values).

我宁愿不使用for循环，因为我读到它们在R中效率低下，但是我对基于apply的选项无法充分理解. R在独立计算机上运行，如果这会影响有关更改代码的建议，通常这将是运行Windows 7变体的标准Dell型台式机.

I would prefer not to use for loops, as I read that they are inefficient in R but I do not understand enough about options based on apply to use those as an alternative. R is being run on stand-alone machines, normally this would be a standard Dell-type desktop running a Windows 7 variant, if that influences the recommendations for how to change the code.

更新:要重现此内容以进行测试， Lambda.Value = 0.4和Male.Resid.Var = 12.1029420429778和Male.Distrib$stddev_u2在所有观察结果中都是恒定值.

Update: To reproduce this for testing, Lambda.Value=0.4 and Male.Resid.Var=12.1029420429778 and Male.Distrib$stddev_u2 is a constant value over all observations.

str(Male.Distrib)是

'data.frame':   7135 obs. of  14 variables:
 $ RndmEff     : num  1.34 -5.86 -3.65 2.7 3.53 ...
 $ RespondentID: num  9966 9967 9970 9972 9974 ...
 $ Subgroup    : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 4 3 2 4 1 4 2 5 1 2 ...
 $ RespondentID: int  9966 9967 9970 9972 9974 9976 9978 9979 9982 9993 ...
 $ Replicates  : num  41067 2322 17434 21723 375 ...
 $ IntakeAmt   : num  33.45 2.53 9.58 43.34 55.66 ...
 $ RACE        : int  2 3 2 2 3 2 2 2 2 1 ...
 $ INDWTS      : num  41067 2322 17434 21723 375 ...
 $ TOTWTS      : num  1.21e+08 1.21e+08 1.21e+08 1.21e+08 1.21e+08 ...
 $ GRPWTS      : num  41657878 22715139 10520535 41657878 10791729 ...
 $ NUMSUBJECTS : int  1466 1100 1424 1466 1061 1466 1424 1252 1061 1424 ...
 $ TOTSUBJECTS : int  7135 7135 7135 7135 7135 7135 7135 7135 7135 7135 ...
 $ FixedEff    : num  6.09 6.76 7.08 6.09 6.18 ...
 $ stddev_u2   : num  2.65 2.65 2.65 2.65 2.65 ...

head(Male.Distrib)是

    RndmEff RespondentID Subgroup RespondentID Replicates IntakeAmt RACE INDWTS    TOTWTS   GRPWTS NUMSUBJECTS TOTSUBJECTS  FixedEff stddev_u2
1  1.343753         9966        6         9966      41067 33.449808    2  41067 120622201 41657878        1466        7135  6.089918  2.645938
2 -5.856516         9967        5         9967       2322  2.533528    3   2322 120622201 22715139        1100        7135  6.755664  2.645938
3 -3.648339         9970        4         9970      17434  9.575439    2  17434 120622201 10520535        1424        7135  7.079757  2.645938
4  2.697533         9972        6         9972      21723 43.340180    2  21723 120622201 41657878        1466        7135  6.089918  2.645938
5  3.531878         9974        3         9974        375 55.660607    3    375 120622201 10791729        1061        7135  6.176319  2.645938
6  6.627767         9976        6         9976      48889 91.480049    2  48889 120622201 41657878        1466        7135  6.089918  2.645938

更新2:导致NaN结果的函数行是

Update 2: the line of the function that is causing the NaN results is

d2ginv_a  <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))

感谢大家的帮助和评论，以及答复的速度.

Thanks to everyone for their assistance and comments, and also for the speed of responses.

更新:@Ben Bolker是正确的，它是导致NaN问题的temp负值.我在进行一些测试时就错过了这一点(在注释掉函数以便仅返回temp值并调用我的结果数据框Test之后).这段代码重现了NaN问题:

Update: @Ben Bolker is correct that it is the negative temp values that are causing the NaN issue. I missed this with some testing (after commenting out the function so that only the temp values are returned, and calling my result data frame Test). This code reproduces the NaN issue:

> min(Test)
[1] -2.103819
> min(Test)^(1/Lambda.Value)
[1] NaN

但是将值作为一个值放入，然后运行相同的(?)计算会给我一个结果，因此在进行手动计算时我错过了这一点:

But putting the value in as a value and then running the same(?) calculation gives me a result, so I missed this when doing manual calculations:

> -2.103819^(1/Lambda.Value) 
[1] -6.419792

我现在有一些工作代码(我认为)使用矢量化，而且速度非常快.万一其他人遇到此问题，我将在下面发布工作代码.我必须添加一个最小值以防止< 0问题与计算.感谢所有帮助的人和咖啡.我确实尝试将rnorm结果放入数据帧，这确实减慢了速度，以这种方式创建它们，然后使用cbind确实非常快. Male.Distrib是我的7135个观测值的完整数据框架，但是此代码应在我之前发布的简化版本(未经测试)上有效.

I now have working code that (I think) uses vectorization, and it is blindingly fast. Just in case anyone else has this issue, I am posting the working code below. I've had to add a minimum to prevent the <0 issue with the calculation. Thank you to everyone who helped, and to coffee. I did try putting the rnorm results to a dataframe, and that really slowed things down, creating them this way and then using cbind is really quick. Male.Distrib is my full data frame of 7135 observations, but this code should work on the cutdown version I posted earlier (not tested).

Min_bca <- ((.5*min(Male.AddSugar$IntakeAmt))^Lambda.Value-1)/Lambda.Value
Test <- Male.Distrib[rep(seq.int(1,nrow(Male.Distrib)), 100), 1:ncol(Male.Distrib)]
RnormOutput <- rnorm(nrow(Test),0,1)
Male.Final <- cbind(Test,RnormOutput)
Male.Final$mc_bca    <- Male.Final$FixedEff + (Male.Final$stddev_u2 *     Male.Final$RnormOutput)
Male.Final$temp      <- ifelse(Lambda.Value*Male.Final$mc_bca+1 > Lambda.Value*Min_bca+1,
                           Lambda.Value*Male.Final$mc_bca+1, Lambda.Value*Min_bca+1)
Male.Final$ginv_a    <- Male.Final$temp^(1/Lambda.Value)
Male.Final$d2ginv_a  <- ifelse(0 > (1-Lambda.Value)*Male.Final$temp^(1/Lambda.Value-2),
                           0, (1-Lambda.Value)*Male.Final$temp^(1/Lambda.Value-2))
Male.Final$mc_amount <- Male.Final$ginv_a + Male.Final$d2ginv_a * Male.Resid.Var / 2

当天的经验教训:

如果您尝试执行我之前尝试的操作，似乎不会在循环中对分布函数进行重新采样
您不能以我尝试的方式使用max()，因为它从列中返回最大值，而我希望从两个值中获得最大值. ifelse语句是要替换的语句.

a distribution function does not appear to be resampled in a loop if you try to do what I was trying earlier
you can't use max() the way I tried, as it returns the maximum value from the column, whereas I wanted the maximum from two values. The ifelse statement is the replacement one to do.

如何在R中为Monte Carlo创建更有效的仿真循环 [英] How to create a more efficient simulation loop for Monte Carlo in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在R中为Monte Carlo创建更有效的仿真循环 [英] How to create a more efficient simulation loop for Monte Carlo in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭