使用R中{boot}的boot()函数在最高级别的群集数据上进行非参数引导 [英] Non-parametric bootstrapping on the highest level of clustered data using boot() function from {boot} in R

查看:413
本文介绍了使用R中{boot}的boot()函数在最高级别的群集数据上进行非参数引导的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两级分层数据,我试图在最高级别上执行非参数引导程序采样,即在保留原始群集内数据的同时,通过替换随机采样最高级别的群集.

I have two-level hierarchical data and I'm trying to perform non-parametric bootstrap sampling on the highest level, i.e., randomly sampling the highest-level clusters with replacement while keeping the original within-cluster data.

我想使用{boot}包中的boot()函数来实现这一点,因为我当时想使用需要引导对象的boot.ci()来建立BCa置信区间.

I want to achieve this using the boot() function in the {boot} package, for the reason that I then would like to build BCa confidence intervals using boot.ci() which requires a boot object.

这是我不走运的尝试-在引导调用上运行调试表明,在集群级别(=主题)没有发生随机采样.

Here follows my unlucky attempt - running a debug on the boot call showed that random sampling is not happening at cluster level (=subject).

### create a very simple two-level dataset with 'subject' as clustering variable

rho <- 0.4
dat <- expand.grid(
    trial=factor(1:5),
    subject=factor(1:3)
    )
sig <- rho * tcrossprod(model.matrix(~ 0 + subject, dat))
diag(sig) <- 1
set.seed(17); dat$value <- chol(sig) %*% rnorm(15, 0, 1)


### my statistic function (adapted from here: http://biostat.mc.vanderbilt.edu/wiki/Main/HowToBootstrapCorrelatedData)

resamp.mean <- function(data, i){
    cluster <- c('subject', 'trial')

    # sample the clustering factor
    cls <- unique(data[[cluster[1]]])[i]   

    # subset on the sampled clustering factors
    sub <- lapply(cls, function(b) subset(data, data[[cluster[1]]]==b))   

    sub.2 <- do.call(rbind, sub)      # join and return samples
    mean((sub.2$value))               # calculate the statistic
}


debugonce(boot)
set.seed(17); dat.boot <- boot(data = dat, statistic = resamp.mean, 4)


### stepping trough the debugger until object 'i' was assigned
### investigating 'i'
# Browse[2]> head(i)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,]    3    7   12   13   10   14   14   15   12    12    12     4     5     9    10
[2,]   15    9    3   13    4   10    2    4    6    11    10     4     9     4     3
[3,]    8    4    7   15   10   12    9    8    9    12     4    15    14    10     4
[4,]   12    3    1   15    8   13    9    1    4    13     9    13     2    11     2

### which is not what I was hoping for.


### I would like something that looks like this, supposing indices = c(2, 2, 1) for the first resample: 

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,]    6    7    8    9   10    6    7    8    9    10     1     2     3     4     5

任何帮助将不胜感激.

Any help would be very much appreciated.

推荐答案

我认为问题出在修改后的统计功能(具体是该功能内的cls对象).你可以试试这个吗?取消注释print语句以查看已采样的主题.它不使用boot期望的index参数,而是仅使用sample,如原始函数中那样.

I think the problem originates from the modified statistic function (specifically, the cls object within the function). Can you try this one? Uncomment the print statement to see which subjects have been sampled. It does not use the index argument which boot expects, instead it just uses sample as in the original function.

resamp.mean <- function(dat, 
                        indices, 
                        cluster = c('subject', 'trial'), 
                        replace = TRUE){
      # boot expects an indices argument but the sampling happens
      # via sample() as in the original source of the function

      # sample the clustering factor
      cls <- sample(unique(dat[[cluster[1]]]), replace=replace)

      # subset on the sampled clustering factors
      sub <- lapply(cls, function(b) subset(dat, dat[[cluster[1]]]==b))

      # join and return samples
      sub <- do.call(rbind, sub)

      # UNCOMMENT HERE TO SEE SAMPLED SUBJECTS 
      # print(sub)

      mean(sub$value)
} 

在计算value的均值之前,先从resamp.mean函数进行一次重新采样,如下所示:

One resample from the resamp.mean function before the mean of value is calculated looks like this:

    trial subject      value
1       1       1 -1.1581291
2       2       1 -0.1458287
3       3       1 -0.2134525
4       4       1 -0.5796521
5       5       1  0.6501587
11      1       3  2.6678441
12      2       3  1.3945740
13      3       3  1.4849435
14      4       3  0.4086737
15      5       3  1.3399146
111     1       1 -1.1581291
121     2       1 -0.1458287
131     3       1 -0.2134525
141     4       1 -0.5796521
151     5       1  0.6501587 

这篇关于使用R中{boot}的boot()函数在最高级别的群集数据上进行非参数引导的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆