在R中的数据框的几列上运行多个线性回归 [英] Running multiple linear regressions across several columns of a data frame in R

查看:157
本文介绍了在R中的数据框的几列上运行多个线性回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样构成的数据集:
在此处输入图像描述

I have a dataset structured as such: enter image description here

我想使用V1,V2等运行线性回归模型和ANOVA。分别作为自变量和g列作为因变量(即lm(V1〜g),lm(V2〜g),依此类推)。这将很简单,除了需要在对列中按级别对这些线性回归进行分组外,例如,我的输出对于所有具有对1.1的行都包含lm(V1〜g),对于所有对都包含lm(V1〜g)对1.201,等等。

I would like to run linear regression models and ANOVA using V1, V2...etc. as the independent variables and the g column as the dependent variable in each case (i.e. lm(V1 ~ g), lm(V2 ~ g), and so forth). This would be straightforward except that these linear regressions need to be grouped by level in the pair column, such that, for example, my output contains lm(V1 ~ g) for all rows with pair 1.1 and lm(V1 ~ g) for all pairs 1.201, etc.

我尝试了多种使用for循环,lapply和data.table包的方法,但没有任何东西可以准确地给我输出喜欢。谁能给我任何有关解决此问题的最佳方法的见解?

I've tried a number of approaches using for loops, lapply and the data.table package, and nothing gives me exactly the output I'd like. Can anyone give me any insight on the best way to tackle this problem?

编辑:
我的完整数据集在对列中有7056个不同对,在100个中V栏(V1 ... V100)。我对这个问题的最新尝试:

My full data set has 7056 different pairs in the pair column and 100 V columns (V1...V100). My latest attempt at this problem:

df$pair <- as.factor(df$pair)
out <- list()
for (i in 3:ncol(df)){
    out[[i]] <- lapply(levels(df$pair), function(x) {
    data.frame(df=x, g = coef(summary(lm(df[,i]~ df$g, data=df[df$pair==x,])),row.names=NULL))})
    }


推荐答案

让我们得到一些 tidyverse 的电源,以及扫帚,并放弃所有这些循环...

Let's get some tidyverse power here, along with broom, and forgo all these loops...

首先,我将创建一个虚拟表:

First I'll make a dummy table:

df <- data.frame(
  g = runif(50), 
  pair = sample(x = c("A", "B", "C"), size = 50, replace = TRUE), 
  V1 = runif(50), 
  V2 = runif(50), 
  V3 = runif(50), 
  V4 = runif(50), 
  V5 = runif(50),
  stringsAsFactors = FALSE
)

这大致就是您的数据结构。现在看一下代码的内容:

That's approximately what your data structure looks like. Now onto the meat of the code:

library(tidyverse)
library(broom)

df %>% 
  as_tibble %>% 
  gather(key = "column", value = "value", V1:V5) %>%       # first set the data in long format
  nest(g, value) %>%                                       # now nest the dependent and independent factors
  mutate(model = map(data, ~lm(g ~ value, data = .))) %>%  # fit the model using purrr
  mutate(tidy_model = map(model, tidy)) %>%                # clean the model output with broom
  select(-data, -model) %>%                                # remove the "untidy" parts
  unnest()                                                 # get it back in a recognizable data frame

其中有以下内容:

# A tibble: 30 x 7
   pair  column term        estimate std.error statistic  p.value
   <chr> <chr>  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
 1 C     V1     (Intercept)  0.470       0.142    3.31   0.00561 
 2 C     V1     value        0.125       0.265    0.472  0.645   
 3 B     V1     (Intercept)  0.489       0.142    3.45   0.00359 
 4 B     V1     value       -0.0438      0.289   -0.151  0.882   
 5 A     V1     (Intercept)  0.515       0.111    4.63   0.000279
 6 A     V1     value       -0.00569     0.249   -0.0229 0.982   
 7 C     V2     (Intercept)  0.367       0.147    2.50   0.0265  
 8 C     V2     value        0.377       0.300    1.26   0.231   
 9 B     V2     (Intercept)  0.462       0.179    2.59   0.0206  
10 B     V2     value        0.0175      0.322    0.0545 0.957   
# … with 20 more rows

是的,很漂亮!请注意,我用 lm(g〜value)代替了 lm(value〜g),因为这是您的文字暗指的说明。

yep, that's a beaut! Note that I used lm(g ~ value) instead of lm(value ~ g) as this is what your text description alluded to.

这篇关于在R中的数据框的几列上运行多个线性回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆