重复测量引导统计数据,按多个因素分组 [英] repeated measures bootstrap stats, grouped by multiple factors

查看:28
本文介绍了重复测量引导统计数据,按多个因素分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的数据框,但显然有更多的行等:

I have a data frame that looks like this, but obviously with many more rows etc:

df <- data.frame(id=c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2),
                 cond=c('A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'),
                 comm=c('X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y','X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'),
                 measure=c(0.8, 1.1, 0.7, 1.2, 0.9, 2.3, 0.6, 1.1, 0.7, 1.3, 0.6, 1.5, 1.0, 2.1, 0.7, 1.2))

所以我们有 2 个因子(每个因子有 2 个水平,因此有 4 个组合)和一个连续度量.我们还有一个重复测量设计,因为我们在每个单元格中有多个 measure 对应于相同的 id.

So we have 2 factors (each with 2 levels, thus 4 combinations) and one continuous measure. We also have a repeated measures design in that we have multiple measure's within each cell that correspond to the same id.

我试图首先解决 groupby 问题,然后是引导程序问题,然后将两者结合起来,但我几乎卡住了...

I've attempted to first solve the groupby issue, then the bootstrap issue, then combine the two, but am pretty much stuck...

统计数据,按 2 个因素分组

我可以通过以下方式获得 4 个单元格中的每个单元格的多个汇总统计信息:

I can get multiple summary stats for each of the 4 cells by:

summary_stats <- aggregate(df$measure, 
                           by = list(df$cond, df$comm),
                           function(x) c(mean = mean(x), median = median(x), sd = sd(x)))
print(summary_stats)

导致

  Group.1 Group.2     x.mean   x.median       x.sd
1       A       X 0.85000000 0.85000000 0.12909944
2       B       X 0.65000000 0.65000000 0.05773503
3       A       Y 1.70000000 1.70000000 0.58878406
4       B       Y 1.25000000 1.20000000 0.17320508

这很棒,因为我们为 4 个单元格中的每一个获得了多个统计数据.

This is great as we are getting multiple stats for each of the 4 cells.

但我真正想要的是 95% 的引导 CI,对于每个统计数据,对于 4 个单元格中的每一个. 我不介意我是否必须运行一次最终解决方案统计数据(例如平均值、中位数等),但一次性完成所有操作可获得奖励积分.

But what I'd really like is the 95% bootstrap CI's, for each stat, for each of the 4 cells. I don't mind if I have to run a final solution once for statistic (e.g. mean, median, etc), but bonus points for doing it all in one go.

重复测量的引导程序

不能很好地完成这项工作,但我想要的是 95% 的引导 CI,以适合这种重复测量设计的方式完成.除非我弄错了,否则我想根据 id(not 基于数据帧的行)选择引导样本,然后计算汇总度量(例如mean) 对于 4 个单元格中的每一个.

Can't quite make this work, but what I want is 95% bootstrap CI's, done in a way which is appropriate for this repeated measures design. Unless I'm mistaken then I want to select bootstrap samples on the basis of id (not on the basis of rows of the dataframe), then calculate a summary measure (e.g. mean) for each of the 4 cells.

library(boot)
myfunc <- function(data, indices) {
   # select bootstrap sample to index into `id`
   d <- data[data$id==indicies,]
   return(c(mean=mean(d), median=median(d), sd = sd(d)))
}

bresults <- boot(data = CO2$uptake, statistic = myfunc, R = 1000)

问题 1:我在通过 id 选择引导程序示例时出错,即行 d <- data[ data$id==indicies, ]

Q1: I'm getting errors in selecting the bootstrap sample by id, i.e. the line d <- data[ data$id==indicies, ]

结合 bootstrap 和 groupby 2 个因素

Combining bootstrap and the groupby 2 factors

问题 2:我不知道如何将两种方法结合在一起以达到最终的预期结果.我唯一的想法是将 aggregate 调用放在 myfunc 中,重复计算每个引导复制下的单元格统计信息,但我在这里使用 R 超出了我的舒适区.

Q2: I have no intuition of how to gel the two approaches together to achieve the final desired result. My only idea is to put the aggregate call in myfunc, to repeatedly calculate cell stats under each bootstrap replicate, but I'm out of my comfort zone with R here.

推荐答案

你的两个问题,你有两个问题:

With your two questions, you have two issues:

  1. 如何以基于 id 而不是行进行重新采样的方式引导(重新采样)您的数据
  2. 如何为 2x2 设计中的四个组执行单独的引导程序
  1. How to bootstrap (resample) your data in such a way that you resample based on id, rather than rows
  2. How to perform separate bootstraps for the four groups in your 2x2 design

一种简单的方法是使用以下软件包(tidyverse):

One easy way to do this would be by using the following packages (all part of the tidyverse):

  • dplyr 用于处理您的数据(特别是汇总您为每个 id 拥有的数据)以及整洁的 %>% 前向管道运算符,它将表达式的结果作为下一个表达式的第一个参数提供,以便您可以链接命令
  • broom 用于对数据框中的每个组进行操作
  • boot(您已经使用过)用于引导
  • dplyr for manipulating your data (in particular, summarising the data you have for each id) and also for the neat %>% forward pipe operator which supplies the result of an expression as the first argument to the next expression so you can chain commands
  • broom for doing an operation for each group in your dataframe
  • boot (which you already use) for the bootstrapping

加载包:

library(dplyr)
library(broom)
library(boot)

首先,为了确保我们重新采样时是否包含一个主题,我会将每个主题的各种值保存为一个列表:

First of all, to make sure when we resample we include a subject or not, I would save the various values each subject has as a list:

df <- df %>%
    group_by(id, cond, comm) %>%
    summarise(measure=list(measure)) %>%
    ungroup()

现在数据框的行数减少了(每个 ID 4 行),并且变量 measure 不再是数字(而是一个列表).这意味着我们可以只使用 boot 提供的索引(解决问题 1),而且当我们真正想要进行计算时,我们必须unlist"它有了它,你的功能现在变成了:

Now the dataframe has fewer rows (4 per ID), and the variable measure is not numeric anymore (instead, it's a list). This means we can just use the indices that boot provides (solving issue 1), but also that we'll have to "unlist" it when we actually want to do calculations with it, so your function now becomes:

myfunc <- function(data, indices) {
    data <- data[indices,]
    return(c(mean=mean(unlist(data$measure)),
             median=median(unlist(data$measure)),
             sd = sd(unlist(data$measure))))
}

现在我们可以简单地使用 boot 对每一行重新采样,我们可以考虑如何按组整齐地进行.这就是 broom 包的用武之地:您可以要求它为数据框中的每个组do 一个操作,并将其存储在 tidy 中代码> 数据框,每个组占一行,函数产生的值占一列.因此,我们只需再次对数据帧进行分组,然后调用 do(tidy(...)),并使用 . 而不是我们的变量名称.这有望为您解决问题 2!

Now that we can simply use boot to resample each row, we can think about how to do it neatly per group. This is where the broom package comes in: you can ask it to do an operation for each group in your data frame, and store it in a tidy dataframe, with one row for each of your groups, and a column for the values that your function produces. So we simply group the dataframe again, and then call do(tidy(...)), with a . instead of the name of our variable. This hopefully solves issue 2 for you!

bootresults <- df %>%
    group_by(cond, comm) %>%
    do(tidy(boot(data = ., statistic = myfunc, R = 1000)))

这会产生:

# Groups:   cond, comm [4]
     cond   comm   term  statistic         bias    std.error
   <fctr> <fctr>  <chr>      <dbl>        <dbl>        <dbl>
 1      A      X   mean 0.85000000  0.000000000 5.280581e-17
 2      A      X median 0.85000000  0.000000000 5.652979e-17
 3      A      X     sd 0.12909944 -0.004704999 4.042676e-02
 4      A      Y   mean 1.70000000  0.000000000 1.067735e-16
 5      A      Y median 1.70000000  0.000000000 1.072347e-16
 6      A      Y     sd 0.58878406 -0.005074338 7.888294e-02
 7      B      X   mean 0.65000000  0.000000000 0.000000e+00
 8      B      X median 0.65000000  0.000000000 0.000000e+00
 9      B      X     sd 0.05773503  0.000000000 0.000000e+00
10      B      Y   mean 1.25000000  0.001000000 7.283065e-02
11      B      Y median 1.20000000  0.027500000 7.729634e-02
12      B      Y     sd 0.17320508 -0.030022214 5.067446e-02

希望这是你想看到的!

如果您想更多地使用该数据帧中的值,您可以使用其他 dplyr 函数来选择您查看该表中的哪些行.例如,要查看条件 A/X 的度量标准偏差的自举标准误差,您可以执行以下操作:

If you want to then use the values from this dataframe a bit more, you can use other dplyr functions to select which rows in this table you look at. For example, to look at the bootstrapped standard error of the standard deviation of your measure for condition A / X, you can do the following:

bootresults %>% filter(cond=='A', comm=='X', term=='sd') %>% pull(std.error)

希望能帮到你!

这篇关于重复测量引导统计数据,按多个因素分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆