如何引导R中被某个因素阻塞的数据集? [英] How to bootstrap dataset in R which is blocked by a factor?
本文介绍了如何引导R中被某个因素阻塞的数据集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
replicate
和level
,以及需要回归的两个变量high.density
和low.density
。我想在此数据集上执行引导,但替换只能在REPLICATE和LEVEL的嵌套因素中发生。
replicate level high.density low.density
1 low 14 36
1 low 54 31
1 mid 82 10
1 mid 24 NA
2 low 12 28
2 low 11 45
2 mid 12 17
2 mid NA 24
2 up 40 10
2 up NA 5
2 up 20 2
例如,在replicate/ level
:1/low
中,low.density
31和36可以互换(或high.density
互换),因此数据集头部可能如下所示:
replicate level high.density low.density
1 low 14 31
1 low 54 36
1 mid 82 10
1 mid 24 NA
然后,我想从该数据集中估计线性回归(GLM)。如果有任何关于努力实现这一目标的反馈,我将不胜感激。
##DATA FRAME (credits: caldwellst)
df <- structure(list(replicate = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2), level = c("low", "low", "mid", "mid", "low", "low", "mid", "mid", "up", "up", "up"), high.density = c(14, 54, 82, 24, 12, 11, 12, NA, 40, NA, 20), low.density = c(36, 31, 10,
NA, 28, 45, 17, 24, 10, 5, 2)), class = c("spec_tbl_df","tbl_df","tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(cols = list(replicate = structure(list(), class = c("collector_double", "collector")), level = structure(list(), class = c("collector_character","collector")), high.density = structure(list(), class = c("collector_double","collector")), low.density = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 1L), class = "col_spec"))
df$replicate <- as.factor(as.numeric(df$replicate))
df$level <- as.factor(as.character(df$level)
)
推荐答案
我们可以利用split
,并根据replicate
和level
的唯一组合进行采样。我们可以重复此过程B
次。
df_shuffle <- function(DF) {
my_split <- split(DF, f = ~ DF$replicate + DF$level)
shuffle <- lapply(my_split, function(x) {
nrX <- nrow(x)
cbind(x[, c('replicate', 'level')],
high.density = x[sample(seq_len(nrX), replace = TRUE), 'high.density'],
low.density = x[sample(seq_len(nrX), replace = TRUE), 'low.density'])
})
DF_new <- do.call(rbind, shuffle)
rownames(DF_new) <- NULL
return(DF_new)
}
B <- 1000L
df_list <- replicate(B, df_shuffle(df), simplify = FALSE)
# ---------------------------------------------------
> df_list[[B]]
replicate level high.density low.density
1 1 low 54 36
2 1 low 54 36
3 2 low 12 45
4 2 low 12 28
5 1 mid 24 10
6 1 mid 82 10
7 2 mid NA 17
8 2 mid 12 17
9 2 up 20 10
10 2 up 40 10
11 2 up 20 5
因为原始数据包含丢失的观测数据,所以我们要么将其相乘归因于它们,要么选择列表删除它们。现在,让我们执行后一种选择。
# listwise delete missing observations
df_list <- lapply(df_list, function(x) x[complete.cases(x), ])
最后,我们对每个洗牌后的数据集执行线性回归,并将B
系数存储在out
中。
row_bind <- function(x) data.frame(do.call(rbind, x))
out <- row_bind(
lapply(df_list, function(x) lm(high.density ~ low.density, data = x)$coef)
)
## out <- row_bind(
## lapply(df_list, function(x) glm(replicate ~ low.density, data = x,
## family = binomial())$coef)
## )
# -------------------------------------------------------------------
> dim(out)
[1] 1000 2
输出
> head(out)
X.Intercept. low.density
1 13.74881 0.2804738
2 20.01074 -0.2095672
3 30.26643 -0.2946373
4 29.19541 -0.2752761
5 37.76273 -0.4555651
6 37.72250 -0.1548349
创建此图像所需的代码here。
这篇关于如何引导R中被某个因素阻塞的数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文