通过仅更改 mutate() 中的一个自变量来拟合多个回归模型 [英] Fitting several regression models by changing only one independent variable within mutate()
问题描述
我怀疑这个问题可能是重复的,但是,我没有发现任何令人满意的东西.想象一个具有如下结构的简单数据集:
I suspect that this question might be a duplicate, however, I found nothing satisfactory. Imagine a simple dataset with a structure like this:
set.seed(123)
df <- data.frame(cov_a = rbinom(100, 1, prob = 0.5),
cov_b = rbinom(100, 1, prob = 0.5),
cont_a = runif(100),
cont_b = runif(100),
dep = runif(100))
cov_a cov_b cont_a cont_b dep
1 0 1 0.238726027 0.784575267 0.9860542973
2 1 0 0.962358936 0.009429905 0.1370674714
3 0 0 0.601365726 0.779065883 0.9053095817
4 1 1 0.515029727 0.729390652 0.5763018376
5 1 0 0.402573342 0.630131853 0.3954488591
6 0 1 0.880246541 0.480910830 0.4498024841
7 1 1 0.364091865 0.156636851 0.7065019011
8 1 1 0.288239281 0.008215520 0.0825027458
9 1 0 0.170645235 0.452458394 0.3393125802
10 0 0 0.172171746 0.492293329 0.6807875512
我正在寻找的是一个优雅的 dplyr
/tidyverse
选项来为每个 cov_
变量拟合一个单独的回归模型,同时包括相同的一组附加变量和相同的因变量.
What I'm looking for is an elegant dplyr
/tidyverse
option to fit a separate regression model for every cov_
variable, while including the same set of additional variables and the same dependent variable.
我可以使用此代码解决此问题(需要 purrr
、dplyr
、tidyr
和 broom
>):
I'm able to solve this problem using this code (require purrr
, dplyr
, tidyr
and broom
):
map(.x = names(df)[grepl("cov_", names(df))],
~ df %>%
nest() %>%
mutate(res = map(data, function(y) tidy(lm(dep ~ cont_a + cont_b + !!sym(.x), data = y)))) %>%
unnest(res))
[[1]]
# A tibble: 4 x 6
data term estimate std.error statistic p.value
<list> <chr> <dbl> <dbl> <dbl> <dbl>
1 <tibble [100 × 5]> (Intercept) 0.472 0.0812 5.81 0.0000000799
2 <tibble [100 × 5]> cont_a -0.103 0.0983 -1.05 0.296
3 <tibble [100 × 5]> cont_b 0.172 0.0990 1.74 0.0848
4 <tibble [100 × 5]> cov_a -0.0455 0.0581 -0.783 0.436
[[2]]
# A tibble: 4 x 6
data term estimate std.error statistic p.value
<list> <chr> <dbl> <dbl> <dbl> <dbl>
1 <tibble [100 × 5]> (Intercept) 0.415 0.0787 5.27 0.000000846
2 <tibble [100 × 5]> cont_a -0.0874 0.0984 -0.888 0.377
3 <tibble [100 × 5]> cont_b 0.181 0.0980 1.84 0.0682
4 <tibble [100 × 5]> cov_b 0.0482 0.0576 0.837 0.405
但是,我想避免使用 double-map()
并通过使用某种更直接或更优雅的方法来解决它.
However, I would like to avoid the use of double-map()
and solve it by using a somehow more direct or elegant approach.
推荐答案
我不确定这是否会被认为更直接/更优雅,但这是我的解决方案,不使用双 map
:
I'm not sure if this will be considered more direct/elegant but here is my solution that does not use a double map
:
library(tidyverse)
library(broom)
gen_model_expr <- function(var) {
form = paste("dep ~ cont_a + cont_b +", var)
tidy(lm(as.formula(form), data = df))
}
grep("cov_", names(df), value = TRUE) %>%
map(gen_model_expr)
输出(注意不保留数据列):
Output (Note that it does not retain the data column):
[[1]]
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.472 0.0812 5.81 0.0000000799
2 cont_a -0.103 0.0983 -1.05 0.296
3 cont_b 0.172 0.0990 1.74 0.0848
4 cov_a -0.0455 0.0581 -0.783 0.436
[[2]]
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.415 0.0787 5.27 0.000000846
2 cont_a -0.0874 0.0984 -0.888 0.377
3 cont_b 0.181 0.0980 1.84 0.0682
4 cov_b 0.0482 0.0576 0.837 0.405
编辑
为了提高速度性能(受到 @TimTeaFan 的启发),下面显示了一个基准测试,比较了检索协变量名称的不同方法.grep("cov_", names(df), value = TRUE)
是最快的
In the interest of speed performance (inspired by @TimTeaFan), a benchmark test comparing the different ways to retrieve the covariate names is shown below. grep("cov_", names(df), value = TRUE)
is the fastest
# A tibble: 4 x 13
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:> <bch:> <dbl> <bch:byt>
1 names(df)[grepl("cov_", names(df))] 7.59µs 8.4µs 101975. 0B
2 grep("cov_", colnames(df), value = TRUE) 8.21µs 8.96µs 103142. 0B
3 grep("cov_", names(df), value = TRUE) 6.96µs 7.43µs 128694. 0B
4 df %>% select(starts_with("cov_")) %>% colnames 1.17ms 1.33ms 636. 5.39KB
这篇关于通过仅更改 mutate() 中的一个自变量来拟合多个回归模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!