用多个因素分组回归数据框 [英] Regression of a Data Frame with multiple factor groupings

查看:227
本文介绍了用多个因素分组回归数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一个回归脚本。
我有一个大约130列的数据框架,其中我需要对所有其他〜100个数字列的一列进行回归(让它调用X列)。



在回归计算之前,我需要将数据分为4个因素: myDat $ Recipe myDat $ Step myDat $ Stage myDat $ Prod while仍然保留其他〜100列和行数据作为回归。然后我需要对每列〜X列进行回归,并用列名打印出R ^ 2值。这是我迄今为止所尝试过的,但它变得过于复杂,我知道有一个更好的方法。

  rm(list = ls())
myDat< - read.csv(file =C:/Users/Documents/myDat.csv,header = TRUE,sep =,)

for(j in myDat $ Recipe)
{
myDatj< - subset(myDat,myDat $ Recipe == j)
for(k in myDatj $ Step)
{
myDatk< - 子集(myDatj,myDatj $ Step == k)
for(i in myDatk $ Stage)
{
myDati< - subset(myDatk, myDatk $ Stage == i)
for(m in myDati $ Prod)
{
myDatm< - subset(myDati,myDati $ Prod == m)
if .nu​​meric(myDatm [3,i]))
{
fit< - lm(myDatk [,i]〜X,data = myDatm)
rsq< - summary(fit) $ r.squared
{
writeLines(paste(rsq,i,\\\
))
}
}
}
}
}
}


解决方案

您可以通过组合 dplyr tidyr 和我的扫帚包(您可以使用 install.packages安装它们)。首先,您需要将所有数值列收集到一个列中:

 库(dplyr)
库(tidyr)
tidied< - myDat%>%
gather(列,值,-X,-Recipe,-Step,-Stage,-Prod)
/ pre>

要了解这是做什么,您可以阅读 tidyr的收集操作。 (这假设除了X,Recipe,Step,Stage和Prod之外的所有列都是数字的,因此应该在你的回归中预测,如果不是这样,你需要事先删除它们,你需要产生一个可重现的例子



然后执行每个回归,同时按列和四个分组变量进行分组。

 库(扫帚)

回归< - tidied%>%
group_by(column,Recipe,Step,阶段,产品)%>%
do(mod = lm(value〜X))

glances< - 回归%>%glance(mod)

生成的扫视数据帧将为列的每个组合都有一行,Recipe,Step,Stage和Prod,以及包含每个模型R平方的 r.squared 列。 (它还将包含 adj.r.squared ,以及其他列(如F-test p-value):请参阅 here 了解更多)。运行 coefs< - 回归%>%tidy(mod)也可能对您有用,因为它将从每个回归中获取系数估计值和p值。



类似的用例在扫帚和小丑小插曲,以及扫帚手稿的第3.1节


I am working on a regression script. I have a data.frame with roughly 130 columns, of which I need to do a regression for one column (lets call it X column) against all the other ~100 numeric columns.

Before the regression is calculated, I need to group the data by 4 factors: myDat$Recipe, myDat$Step, myDat$Stage, and myDat$Prod while still keeping the other ~100 columns and row data attached for the regression. Then I need to do a regression of each column ~ X column and print out the R^2 value with the column name. This is what I've tried so far but it is getting overly complicated and I know there's got to be a better way.

 rm(list=ls())
 myDat <- read.csv(file="C:/Users/Documents/myDat.csv",              header=TRUE, sep=",")

for(j in myDat$Recipe)
{
  myDatj <- subset(myDat, myDat$Recipe == j) 
  for(k in myDatj$Step)
  {
    myDatk <- subset(myDatj, myDatj$Step == k) 
    for(i in myDatk$Stage)
    {
      myDati <- subset(myDatk, myDatk$Stage == i)
      for(m in myDati$Prod)
      {
        myDatm <- subset(myDati, myDati$Prod == m)
          if(is.numeric(myDatm[3,i]))  
          {     
          fit <- lm(myDatk[,i] ~ X, data=myDatm) 
          rsq <- summary(fit)$r.squared
            {
              writeLines(paste(rsq,i,"\n"))
           }  
         }
      }
    }
  }  
}      

解决方案

You can do this by combining dplyr, tidyr and my broom package (you can install them with install.packages). First you need to gather all the numeric columns into a single column:

library(dplyr)
library(tidyr)
tidied <- myDat %>%
    gather(column, value, -X, -Recipe, -Step, -Stage, -Prod)

To understand what this does, you can read up on tidyr's gather operation. (This assumes that all columns besides X, Recipe, Step, Stage, and Prod are numeric and therefore should be predicted in your regression. If that's not the case, you need to remove them beforehand. You'll need to produce a reproducible example of the problem if you need a more customized solution).

Then perform each regression, while grouping by the column and the four grouping variables.

library(broom)

regressions <- tidied %>%
    group_by(column, Recipe, Step, Stage, Prod) %>%
    do(mod = lm(value ~ X))

glances <- regressions %>% glance(mod)

The resulting glances data frame will have one row for each combination of column, Recipe, Step, Stage, and Prod, along with an r.squared column containing the R-squared from each model. (It will also contain adj.r.squared, along with other columns such as F-test p-value: see here for more). Running coefs <- regressions %>% tidy(mod) will probably also be useful for you, as it will get the coefficient estimates and p-values from each regression.

A similar use case is described in the "broom and dplyr" vignette, and in Section 3.1 of the broom manuscript.

这篇关于用多个因素分组回归数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆