使用mice R包并行计算多重插补 [英] Parallel computation of multiple imputation by using mice R package

查看:22
本文介绍了使用mice R包并行计算多重插补的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过在 R 中使用 mice 来运行 150 个多重插补.但是,为了节省一些计算时间,我会撒谎将过程细分为并行流(如 Stef van Buuren 在对缺失数据的灵活插补"中所建议的那样).

I want to run 150 multiple imputations by using mice in R. However, in order to save some computing time, I would lie to subdivide the process in parallel streams (as suggested by Stef van Buuren in "Flexible Imputation for Missing Data").

我的问题是:怎么做?

我可以想象两个选项:

选项 1:

imp1<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp2<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp...<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp150<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)

然后使用 completeas.mids 将插补组合在一起

and then combine the imputations together by using complete and as.mids afterwards

选项 2:

imp1<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp2<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp...<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp150<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)

通过添加 VAL_1to150 否则在我看来(我可能是错的)如果它们都使用相同的数据集和相同的种子运行,你将得到 150 倍相同的结果.

by adding VAL_1to150 otherwise it seems to me (I may be wrong) that if they all run with the same dataset and the same seed you will have 150 times the same result.

还有其他选择吗?

谢谢

推荐答案

所以主要问题是组合插补,在我看来有三个选项,使用 ibind, complete 如上所述或试图保持中频结构.我强烈建议 ibind 解决方案.其他的留给好奇的人答案.

So the main problem is combining the imputations and as I see it there are three options, using ibind, complete as described or trying to keep the mids structure. I strongly suggest the ibind solution. The others are left in the answer for those curious.

在做任何事情之前,我们需要获得平行小鼠插补.并行部分相当简单,我们需要做的就是使用并行包并确保我们使用clusterSetRNGStream设置种子:

Before doing anything we need to get the parallel mice imputations. The parallel part is rather simple, all we need to do is to use the parallel package and make sure that we set the seed using the clusterSetRNGStream:

library(parallel)
# Using all cores can slow down the computer
# significantly, I therefore try to leave one
# core alone in order to be able to do something 
# else during the time the code runs
cores_2_use <- detectCores() - 1

cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
clusterExport(cl, "nhanes")
clusterEvalQ(cl, library(mice))
imp_pars <- 
  parLapply(cl = cl, X = 1:cores_2_use, fun = function(no){
    mice(nhanes, m = 30, printFlag = FALSE)
  })
stopCluster(cl)

以上将产生 cores_2_use * 30 个估算数据集.

The above will yield cores_2_use * 30 imputed datasets.

正如@AleksanderBlekh 所建议的,mice::ibind 可能是最好、最直接的解决方案:

As @AleksanderBlekh suggested, the mice::ibind is probably the best, most straightforward solution:

imp_merged <- imp_pars[[1]]
for (n in 2:length(imp_pars)){
  imp_merged <- 
    ibind(imp_merged,
          imp_pars[[n]])
}

使用 foreachibind

也许最简单的替代方法是使用 foreach:

library(foreach)
library(doParallel)
cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
registerDoParallel(cl)

library(mice)
imp_merged <-
  foreach(no = 1:cores_2_use, 
          .combine = ibind, 
          .export = "nhanes",
          .packages = "mice") %dopar%
{
  mice(nhanes, m = 30, printFlag = FALSE)
}
stopCluster(cl)

使用完整

使用 complete(..., action="long")rbind 提取完整数据集,然后使用 as.midscode> 其他 mice 对象可能工作得很好,但它生成的对象比其他两种方法更薄:

Using complete

Extracting the full datasets using complete(..., action="long"), rbind-ing these and then using as.mids other mice objects may work well but it generates a slimmer object than what the other two approaches:

merged_df <- nhanes
merged_df <- 
  cbind(data.frame(.imp = 0,
                   .id = 1:nrow(nhanes)),
        merged_df)
for (n in 1:length(imp_pars)){
  tmp <- complete(imp_pars[[n]], action = "long")
  tmp$.imp <- as.numeric(tmp$.imp) + max(merged_df$.imp)
  merged_df <- 
    rbind(merged_df,
          tmp)
}

imp_merged <- 
  as.mids(merged_df)

# Compare the most important the est and se for easier comparison
cbind(summary(pool(with(data=imp_merged,
                        exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
      summary(pool(with(data=mice(nhanes, 
                                  m = 60, 
                                  printFlag = FALSE),
                        exp=lm(bmi~age+hyp+chl))))[,c("est", "se")])

给出输出:

                    est         se         est         se
(Intercept) 20.41921496 3.85943925 20.33952967 3.79002725
age         -3.56928102 1.35801557 -3.65568620 1.27603817
hyp          1.63952970 2.05618895  1.60216683 2.17650536
chl          0.05396451 0.02278867  0.05525561 0.02087995

保持正确的中间对象

下面我的替代方法展示了如何合并插补对象并保留 mids 对象背后的全部功能.自从 ibind 解决方案以来,我将其留给任何有兴趣探索如何合并复杂列表的人.

Keeping a correct mids-object

My alternative approach below shows how to merge imputation objects and retain the full functionality behind the mids object. Since the ibind solution I've left this in for anyone interested in exploring how to merge complex lists.

我已经研究了 mice 的 mids-object,为了在并行运行后至少获得一个类似的 mids-object,您必须采取一些步骤.如果我们检查 mids 对象并比较具有两种不同设置的两个对象,我们会得到:

I've looked into mice's mids-object and there are a few step that you have to take in order to get at least a similar mids-object after running in parallel. If we examine the mids-object and compare two objects with two different setups we get:

library(mice)
imp <- list()
imp <- c(imp,
         list(mice(nhanes, m = 40)))
imp <- c(imp,
         list(mice(nhanes, m = 20)))

sapply(names(imp[[1]]),
       function(n)
         try(all(useful::compare.list(imp[[1]][[n]], 
                                      imp[[2]][[n]]))))

在这里您可以看到两次运行之间的调用、m、imp、chainMean 和 chainVar 不同.其中,imp 无疑是最重要的,但更新其他组件似乎也是一个明智的选择.因此,我们将从构建小鼠合并函数开始:

Where you can see that the call, m, imp, chainMean, and chainVar differ between the two runs. Out of these the imp is without doubt the most important but it seems like a wise option to update the other components as well. We will therefore start by building a mice merger function:

mergeMice <- function (imp) {
  merged_imp <- NULL
  for (n in 1:length(imp)){
    if (is.null(merged_imp)){
      merged_imp <- imp[[n]]
    }else{
      counter <- merged_imp$m
      # Update counter
      merged_imp$m <- 
        merged_imp$m + imp[[n]]$m
      # Rename chains
      dimnames(imp[[n]]$chainMean)[[3]] <-
        sprintf("Chain %d", (counter + 1):merged_imp$m)
      dimnames(imp[[n]]$chainVar)[[3]] <-
        sprintf("Chain %d", (counter + 1):merged_imp$m)
      # Merge chains
      merged_imp$chainMean <- 
        abind::abind(merged_imp$chainMean, 
                     imp[[n]]$chainMean)
      merged_imp$chainVar <- 
        abind::abind(merged_imp$chainVar, 
                     imp[[n]]$chainVar)
      for (nn in names(merged_imp$imp)){
        # Non-imputed variables are not in the
        # data.frame format but are null
        if (!is.null(imp[[n]]$imp[[nn]])){
          colnames(imp[[n]]$imp[[nn]]) <- 
            (counter + 1):merged_imp$m
          merged_imp$imp[[nn]] <- 
            cbind(merged_imp$imp[[nn]],
                  imp[[n]]$imp[[nn]])
        }
      }
    }
  }
  # TODO: The function should update the $call parameter
  return(merged_imp)
}

我们现在可以通过以下方式简单地合并上面生成的两个插补:

We can now simply merge the two above generated imputations through:

merged_imp <- mergeMice(imp)
merged_imp_pars <- mergeMice(imp_pars)

现在看来我们得到了正确的输出:

Now it seems that we get the right output:

# Compare the three alternatives
cbind(
  summary(pool(with(data=merged_imp,
                    exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
 summary(pool(with(data=merged_imp_pars,
                    exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
 summary(pool(with(data=mice(nhanes, 
                             m = merged_imp$m, 
                             printFlag = FALSE),
                   exp=lm(bmi~age+hyp+chl))))[,c("est", "se")])

给出:

                    est         se         est        se
(Intercept) 20.16057550 3.74819873 20.31814393 3.7346252
age         -3.67906629 1.19873118 -3.64395716 1.1476377
hyp          1.72637216 2.01171565  1.71063127 1.9936347
chl          0.05590999 0.02350609  0.05476829 0.0213819
                    est         se
(Intercept) 20.14271905 3.60702992
age         -3.78345532 1.21550474
hyp          1.77361005 2.11415290
chl          0.05648672 0.02046868

好的,就是这样.玩得开心.

Ok, that's it. Have fun.

这篇关于使用mice R包并行计算多重插补的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆