使用具有聚类数据的小鼠进行插补 [英] Imputation using mice with clustered data

查看:119
本文介绍了使用具有聚类数据的小鼠进行插补的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我正在使用mice包来估算缺少的数据.我是归因于算术的新手,所以我已经讲了一点,但是却遇到了陡峭的学习曲线.举个玩具示例:

So I am using the mice package to impute missing data. I'm new to imputation so I've got to a point but have run into a steep learning curve. To give a toy example:

library(mice)
# Using nhanes dataset as example
df1 <- mice(nhanes, m=10)

因此您可以看到,我使用默认设置大多数情况下估算了df1 10次-我很乐意在回归模型,合并结果等中使用此结果.但是,在我的实际数据中,我有来自不同国家/地区的调查数据.因此,失踪的程度因国家/地区而异,具体变量的值(即年龄,受教育程度等)的值也有所不同.因此,我想对失踪情况进行归因,以便按国家/地区进行聚类.因此,我将创建一个没有缺失的分组变量(当然,在这个玩具示例中,与其他变量的相关性缺失了,但是在我的真实数据中它们存在)

So as you can see I imputed df1 10 times using mostly default settings - and I am comfortable using this result in regression models, pooling results etc. However in my real life data, I have survey data from different countries. And so levels of missings differ by country, as do the values of specific variables - i.e. age, education level etc. Therefore I would like to impute the misssings, allowing for clustering by the country. So I will create a grouping variable which has no missings (of course in this toy example the correlations with other variables are missing, but in my real data they exist)

# Create a grouping variable
nhanes$country <- sample(c("A", "B"), size=nrow(nhanes), replace=TRUE)

那么如何告诉mice()该变量与其他变量不同-即它是多级数据集中的一个级?

So how to I tell mice() that this variable is different from the others - i.e. it is a level in a multi-level dataset?

推荐答案

如果您像混合效应"模型中那样考虑集群,则应该使用mice提供的用于集群数据的方法.这些方法可以在手册中找到,并且通常像2l.something.

If you're thinking clusters as in "mixed-effects" models, then you should use the methods provided by mice intended for clustered data. These methods can be found in the manual and are usually prefixed like 2l.something.

mice中对群集数据的多种方法有所限制,但是我建议对较低级别的单位和群集级别中的2l.only.norm的丢失数据使用2l.pan.

The variety of methods for clustered data is somewhat limited in mice, but I can recommend using 2l.pan for missing data in lower-level units and 2l.only.norm at the cluster level.

作为混合效果模型的替代方法,您可以考虑使用哑指标来表示集群结构(即,每个集群一个哑变量).从混合效应模型的角度考虑集群时,此方法并不理想.因此,如果您要进行混合效果分析,那么请尽可能使用混合效果模型.

As an alternative to mixed-effects models, you may consider using dummy indicators to represent the cluster structure (i.e., one dummy variable for each cluster). This method is not ideal when you think of the clusters from the perspective of mixed-effects models. So if you want to do mixed-effects analyses, then stick to mixed-effects models when you can.

下面,我将展示这两种策略的示例.

Below, I show an example for both strategies.

准备工作

library(mice)
data(nhanes)

set.seed(123)
nhanes <- within(nhanes,{
  country <- factor(sample(LETTERS[1:10], size=nrow(nhanes), replace=TRUE))
  countryID <- as.numeric(country)
})

案例1:使用混合效应模型进行插补

本节使用2l.pan来估算缺少数据的三个变量.请注意,我通过在预测变量矩阵中指定-2来将clusterID用作聚类变量.对于所有其他变量,我仅分配固定效果(1).

This section uses 2l.pan to impute the three variables with missing data. Note that I use clusterID as the cluster variable by specifying a -2 in the predictor matrix. To all other variables, I assign fixed effects only (1).

# "empty" imputation as a template
imp0 <- mice(nhanes, maxit=0)
pred1 <- imp0$predictorMatrix
meth1 <- imp0$method

# set imputation procedures
meth1[c("bmi","hyp","chl")] <- "2l.pan"

# set predictor Matrix (mixed-effects models with random intercept
# for countryID and fixed effects otherwise)
pred1[,"country"] <- 0     # don't use country factor
pred1[,"countryID"] <- -2  # use countryID as cluster variable
pred1["bmi", c("age","hyp","chl")] <- c(1,1,1)  # fixed effects (bmi)
pred1["hyp", c("age","bmi","chl")] <- c(1,1,1)  # fixed effects (hyp)
pred1["chl", c("age","bmi","hyp")] <- c(1,1,1)  # fixed effects (chl)

# impute
imp1 <- mice(nhanes, maxit=20, m=10, predictorMatrix=pred1, method=meth1)

案例2:使用虚拟指示器(DI)进行群集插补

本节使用pmm进行插补,并且群集结构以临时"方式表示.也就是说,聚类不是由随机效果代表的,而是由固定效果代表的.这可能会因为缺少数据而夸大了变量的群集级别的可变性,因此请确保您知道使用它时的操作.

This section uses pmm for imputation, and the clustered structure is represented in an "ad hoc" fashion. That is, the clustered aren't represented by random effects but by fixed effects instead. This may exaggerate the cluster-level variability of the variables with missing data, so be sure you know what you do when you use it.

# create dummy indicator variables
DIs <- with(nhanes, contrasts(country)[country,])
colnames(DIs) <- paste0("country",colnames(DIs))
nhanes <- cbind(nhanes,DIs)


# "empty" imputation as a template
imp0 <- mice(nhanes, maxit=0)
pred2 <- imp0$predictorMatrix
meth2 <- imp0$method

# set imputation procedures
meth2[c("bmi","hyp","chl")] <- "pmm"

# for countryID and fixed effects otherwise)
pred2[,"country"] <- 0     # don't use country factor
pred2[,"countryID"] <- 0   # don't use countryID
pred2[,colnames(DIs)] <- 1 # use dummy indicators
pred2["bmi", c("age","hyp","chl")] <- c(1,1,1)  # fixed effects (bmi)
pred2["hyp", c("age","bmi","chl")] <- c(1,1,1)  # fixed effects (hyp)
pred2["chl", c("age","bmi","hyp")] <- c(1,1,1)  # fixed effects (chl)

# impute
imp2 <- mice(nhanes, maxit=20, m=10, predictorMatrix=pred2, method=meth2)

如果您想了解这些方法的想法,请查看一个两个 "noreferrer>这些论文.

If you want to read up on what to think of these methods, have a look at one or two of these papers.

这篇关于使用具有聚类数据的小鼠进行插补的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆