使用鼠标R包并行计算多重插补 [英] Parallel computation of multiple imputation by using mice R package

查看:91
本文介绍了使用鼠标R包并行计算多重插补的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过使用R中的mice运行150个多重插补.但是,为了节省一些计算时间,我会撒谎将进程细分为并行流(如Stef van Buuren在丢失数据的灵活插补"中所建议的那样).

I want to run 150 multiple imputations by using mice in R. However, in order to save some computing time, I would lie to subdivide the process in parallel streams (as suggested by Stef van Buuren in "Flexible Imputation for Missing Data").

我的问题是:该怎么做?

My question is: how to do that?

我可以想象2种选择:

opt.1:

imp1<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp2<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp...<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp150<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)

,然后通过使用completeas.mids将归因组合在一起

and then combine the imputations together by using complete and as.mids afterwards

opt.2:

imp1<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp2<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp...<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp150<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)

通过添加VAL_1to150

否则,在我看来(我可能错了),如果它们都使用相同的数据集和相同的种子运行,您将获得150倍的相同结果.

by adding VAL_1to150 otherwise it seems to me (I may be wrong) that if they all run with the same dataset and the same seed you will have 150 times the same result.

还有其他选择吗?

谢谢

推荐答案

所以主要问题是合并归因,并且如我所见,存在三种选择,如所述使用ibindcomplete或尝试保留中型结构.我强烈建议使用ibind解决方案.对于那些好奇的人,其他人留在了答案中.

So the main problem is combining the imputations and as I see it there are three options, using ibind, complete as described or trying to keep the mids structure. I strongly suggest the ibind solution. The others are left in the answer for those curious.

在执行任何操作之前,我们需要获取并行的鼠标插补.并行部分非常简单,我们需要做的就是使用并行包,并确保使用clusterSetRNGStream:

Before doing anything we need to get the parallel mice imputations. The parallel part is rather simple, all we need to do is to use the parallel package and make sure that we set the seed using the clusterSetRNGStream:

library(parallel)
# Using all cores can slow down the computer
# significantly, I therefore try to leave one
# core alone in order to be able to do something 
# else during the time the code runs
cores_2_use <- detectCores() - 1

cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
clusterExport(cl, "nhanes")
clusterEvalQ(cl, library(mice))
imp_pars <- 
  parLapply(cl = cl, X = 1:cores_2_use, fun = function(no){
    mice(nhanes, m = 30, printFlag = FALSE)
  })
stopCluster(cl)

以上内容将生成cores_2_use * 30个估算数据集.

The above will yield cores_2_use * 30 imputed datasets.

正如@AleksanderBlekh所建议的,mice::ibind可能是最好,最直接的解决方案:

As @AleksanderBlekh suggested, the mice::ibind is probably the best, most straightforward solution:

imp_merged <- imp_pars[[1]]
for (n in 2:length(imp_pars)){
  imp_merged <- 
    ibind(imp_merged,
          imp_pars[[n]])
}

foreachibind一起使用

也许最简单的选择是使用foreach:

Using foreach with ibind

The perhaps the simplest alternative is to use foreach:

library(foreach)
library(doParallel)
cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
registerDoParallel(cl)

library(mice)
imp_merged <-
  foreach(no = 1:cores_2_use, 
          .combine = ibind, 
          .export = "nhanes",
          .packages = "mice") %dopar%
{
  mice(nhanes, m = 30, printFlag = FALSE)
}
stopCluster(cl)

使用complete

使用complete(..., action="long")提取完整的数据集,rbind-将其提取,然后使用as.mids其他mice对象可能会很好,但是它比其他两种方法生成的对象更苗条:

Using complete

Extracting the full datasets using complete(..., action="long"), rbind-ing these and then using as.mids other mice objects may work well but it generates a slimmer object than what the other two approaches:

merged_df <- nhanes
merged_df <- 
  cbind(data.frame(.imp = 0,
                   .id = 1:nrow(nhanes)),
        merged_df)
for (n in 1:length(imp_pars)){
  tmp <- complete(imp_pars[[n]], action = "long")
  tmp$.imp <- as.numeric(tmp$.imp) + max(merged_df$.imp)
  merged_df <- 
    rbind(merged_df,
          tmp)
}

imp_merged <- 
  as.mids(merged_df)

# Compare the most important the est and se for easier comparison
cbind(summary(pool(with(data=imp_merged,
                        exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
      summary(pool(with(data=mice(nhanes, 
                                  m = 60, 
                                  printFlag = FALSE),
                        exp=lm(bmi~age+hyp+chl))))[,c("est", "se")])

给出输出:

                    est         se         est         se
(Intercept) 20.41921496 3.85943925 20.33952967 3.79002725
age         -3.56928102 1.35801557 -3.65568620 1.27603817
hyp          1.63952970 2.05618895  1.60216683 2.17650536
chl          0.05396451 0.02278867  0.05525561 0.02087995

保持正确的中音对象

我下面的替代方法展示了如何合并插补对象并保留mids对象后面的全部功能.自ibind解决方案以来,我将其留给有兴趣探索如何合并复杂列表的人.

Keeping a correct mids-object

My alternative approach below shows how to merge imputation objects and retain the full functionality behind the mids object. Since the ibind solution I've left this in for anyone interested in exploring how to merge complex lists.

我研究了mice的mids对象,您必须采取一些步骤,才能在并行运行后获得至少一个相似的mids对象.如果我们检查mids-object并将两个具有不同设置的对象进行比较,则会得到:

I've looked into mice's mids-object and there are a few step that you have to take in order to get at least a similar mids-object after running in parallel. If we examine the mids-object and compare two objects with two different setups we get:

library(mice)
imp <- list()
imp <- c(imp,
         list(mice(nhanes, m = 40)))
imp <- c(imp,
         list(mice(nhanes, m = 20)))

sapply(names(imp[[1]]),
       function(n)
         try(all(useful::compare.list(imp[[1]][[n]], 
                                      imp[[2]][[n]]))))

您可以在其中看到两次运行之间的调用,m,imp,chainMean和chainVar有所不同.在这些因素中,imp无疑是最重要的,但似乎也应该更新其他组件.因此,我们将从构建鼠标合并功能开始:

Where you can see that the call, m, imp, chainMean, and chainVar differ between the two runs. Out of these the imp is without doubt the most important but it seems like a wise option to update the other components as well. We will therefore start by building a mice merger function:

mergeMice <- function (imp) {
  merged_imp <- NULL
  for (n in 1:length(imp)){
    if (is.null(merged_imp)){
      merged_imp <- imp[[n]]
    }else{
      counter <- merged_imp$m
      # Update counter
      merged_imp$m <- 
        merged_imp$m + imp[[n]]$m
      # Rename chains
      dimnames(imp[[n]]$chainMean)[[3]] <-
        sprintf("Chain %d", (counter + 1):merged_imp$m)
      dimnames(imp[[n]]$chainVar)[[3]] <-
        sprintf("Chain %d", (counter + 1):merged_imp$m)
      # Merge chains
      merged_imp$chainMean <- 
        abind::abind(merged_imp$chainMean, 
                     imp[[n]]$chainMean)
      merged_imp$chainVar <- 
        abind::abind(merged_imp$chainVar, 
                     imp[[n]]$chainVar)
      for (nn in names(merged_imp$imp)){
        # Non-imputed variables are not in the
        # data.frame format but are null
        if (!is.null(imp[[n]]$imp[[nn]])){
          colnames(imp[[n]]$imp[[nn]]) <- 
            (counter + 1):merged_imp$m
          merged_imp$imp[[nn]] <- 
            cbind(merged_imp$imp[[nn]],
                  imp[[n]]$imp[[nn]])
        }
      }
    }
  }
  # TODO: The function should update the $call parameter
  return(merged_imp)
}

我们现在可以通过以下方式简单地合并上面生成的两个插补:

We can now simply merge the two above generated imputations through:

merged_imp <- mergeMice(imp)
merged_imp_pars <- mergeMice(imp_pars)

现在看来我们得到了正确的输出:

Now it seems that we get the right output:

# Compare the three alternatives
cbind(
  summary(pool(with(data=merged_imp,
                    exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
 summary(pool(with(data=merged_imp_pars,
                    exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
 summary(pool(with(data=mice(nhanes, 
                             m = merged_imp$m, 
                             printFlag = FALSE),
                   exp=lm(bmi~age+hyp+chl))))[,c("est", "se")])

赠予:

                    est         se         est        se
(Intercept) 20.16057550 3.74819873 20.31814393 3.7346252
age         -3.67906629 1.19873118 -3.64395716 1.1476377
hyp          1.72637216 2.01171565  1.71063127 1.9936347
chl          0.05590999 0.02350609  0.05476829 0.0213819
                    est         se
(Intercept) 20.14271905 3.60702992
age         -3.78345532 1.21550474
hyp          1.77361005 2.11415290
chl          0.05648672 0.02046868

好,就是这样.玩得开心.

Ok, that's it. Have fun.

这篇关于使用鼠标R包并行计算多重插补的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆