通过仅更改 mutate() 中的一个自变量来拟合多个回归模型 [英] Fitting several regression models by changing only one independent variable within mutate()

查看:44
本文介绍了通过仅更改 mutate() 中的一个自变量来拟合多个回归模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我怀疑这个问题可能是重复的,但是,我没有发现任何令人满意的东西.想象一个具有如下结构的简单数据集:

I suspect that this question might be a duplicate, however, I found nothing satisfactory. Imagine a simple dataset with a structure like this:

set.seed(123)
df <- data.frame(cov_a = rbinom(100, 1, prob = 0.5),
                 cov_b = rbinom(100, 1, prob = 0.5),
                 cont_a  = runif(100),
                 cont_b = runif(100),
                 dep = runif(100))

    cov_a cov_b      cont_a      cont_b          dep
1       0     1 0.238726027 0.784575267 0.9860542973
2       1     0 0.962358936 0.009429905 0.1370674714
3       0     0 0.601365726 0.779065883 0.9053095817
4       1     1 0.515029727 0.729390652 0.5763018376
5       1     0 0.402573342 0.630131853 0.3954488591
6       0     1 0.880246541 0.480910830 0.4498024841
7       1     1 0.364091865 0.156636851 0.7065019011
8       1     1 0.288239281 0.008215520 0.0825027458
9       1     0 0.170645235 0.452458394 0.3393125802
10      0     0 0.172171746 0.492293329 0.6807875512

我正在寻找的是一个优雅的 dplyr/tidyverse 选项来为每个 cov_ 变量拟合一个单独的回归模型,同时包括相同的一组附加变量和相同的因变量.

What I'm looking for is an elegant dplyr/tidyverse option to fit a separate regression model for every cov_ variable, while including the same set of additional variables and the same dependent variable.

我可以使用此代码解决此问题(需要 purrrdplyrtidyrbroom>):

I'm able to solve this problem using this code (require purrr, dplyr, tidyr and broom):

map(.x = names(df)[grepl("cov_", names(df))],
    ~ df %>%
     nest() %>%
     mutate(res = map(data, function(y) tidy(lm(dep ~ cont_a + cont_b + !!sym(.x), data = y)))) %>%
     unnest(res))

[[1]]
# A tibble: 4 x 6
  data               term        estimate std.error statistic      p.value
  <list>             <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 <tibble [100 × 5]> (Intercept)   0.472     0.0812     5.81  0.0000000799
2 <tibble [100 × 5]> cont_a       -0.103     0.0983    -1.05  0.296       
3 <tibble [100 × 5]> cont_b        0.172     0.0990     1.74  0.0848      
4 <tibble [100 × 5]> cov_a        -0.0455    0.0581    -0.783 0.436       

[[2]]
# A tibble: 4 x 6
  data               term        estimate std.error statistic     p.value
  <list>             <chr>          <dbl>     <dbl>     <dbl>       <dbl>
1 <tibble [100 × 5]> (Intercept)   0.415     0.0787     5.27  0.000000846
2 <tibble [100 × 5]> cont_a       -0.0874    0.0984    -0.888 0.377      
3 <tibble [100 × 5]> cont_b        0.181     0.0980     1.84  0.0682     
4 <tibble [100 × 5]> cov_b         0.0482    0.0576     0.837 0.405 

但是,我想避免使用 double-map() 并通过使用某种更直接或更优雅的方法来解决它.

However, I would like to avoid the use of double-map() and solve it by using a somehow more direct or elegant approach.

推荐答案

我不确定这是否会被认为更直接/更优雅,但这是我的解决方案,不使用双 map:

I'm not sure if this will be considered more direct/elegant but here is my solution that does not use a double map:

library(tidyverse)
library(broom)

gen_model_expr <- function(var) {
  form = paste("dep ~ cont_a + cont_b +", var)
  tidy(lm(as.formula(form), data = df))
}

grep("cov_", names(df), value = TRUE) %>%
  map(gen_model_expr)

输出(注意不保留数据列):

Output (Note that it does not retain the data column):

[[1]]
# A tibble: 4 x 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)   0.472     0.0812     5.81  0.0000000799
2 cont_a       -0.103     0.0983    -1.05  0.296       
3 cont_b        0.172     0.0990     1.74  0.0848      
4 cov_a        -0.0455    0.0581    -0.783 0.436       

[[2]]
# A tibble: 4 x 5
  term        estimate std.error statistic     p.value
  <chr>          <dbl>     <dbl>     <dbl>       <dbl>
1 (Intercept)   0.415     0.0787     5.27  0.000000846
2 cont_a       -0.0874    0.0984    -0.888 0.377      
3 cont_b        0.181     0.0980     1.84  0.0682     
4 cov_b         0.0482    0.0576     0.837 0.405 

编辑

为了提高速度性能(受到 @TimTeaFan 的启发),下面显示了一个基准测试,比较了检索协变量名称的不同方法.grep("cov_", names(df), value = TRUE) 是最快的

In the interest of speed performance (inspired by @TimTeaFan), a benchmark test comparing the different ways to retrieve the covariate names is shown below. grep("cov_", names(df), value = TRUE) is the fastest

# A tibble: 4 x 13
  expression                                         min median `itr/sec` mem_alloc
  <bch:expr>                                      <bch:> <bch:>     <dbl> <bch:byt>
1 names(df)[grepl("cov_", names(df))]             7.59µs  8.4µs   101975.        0B
2 grep("cov_", colnames(df), value = TRUE)        8.21µs 8.96µs   103142.        0B
3 grep("cov_", names(df), value = TRUE)           6.96µs 7.43µs   128694.        0B
4 df %>% select(starts_with("cov_")) %>% colnames 1.17ms 1.33ms      636.    5.39KB

这篇关于通过仅更改 mutate() 中的一个自变量来拟合多个回归模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆