用dplyr拟合几个回归模型 [英] Fitting several regression models with dplyr

查看:84
本文介绍了用dplyr拟合几个回归模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用dplyr适合每个小时的模型(因子变量),我收到错误,我不太确定什么是错的。

  df.h<  -  data.frame(
hour = factor(rep(1:24,each = 21)),
price = runif(504,min = -10,max = 125),
wind = runif(504,min = 0,max = 2500),
temp = runif(504,min = -10,max = 25)


df.h < - tbl_df(df.h)
df.h < - group_by(df.h,小时)

group_size(df .h)#check out,21 obs。对于每个因子变量

#不同的尝试:
reg.models< - do(df.h,formula = price〜wind + temp)

reg。模型< - do(df.h,.f = lm(price〜wind + temp,data = df.h))

我已经尝试过各种变体,但是我不能让它工作。

解决方案

大概在2015年5月的最简单的方法是使用 扫帚 扫帚包含三个功能,可以按组从统计操作处理复杂的返回对象: tidy (它处理来自统计操作按组), glance (其处理按组进行的统计操作的汇总统计)和 augment (其中处理来自统计操作的观察级别结果的组合)。



这是一个演示,它用于将各种线性回归的结果提取为整齐的 data_frame s。


  1. 整洁

     库(dplyr)
    库(扫帚)

    df.h = data.frame(
    hour = factor(rep(1:24,each = 21)),
    price = runif(504,min = -10,max = 125) ,
    wind = runif(504,min = 0,max = 2500),
    temp = runif(504,min = - 10,max = 25)


    dfHour = df.h%>%group_by(小时)%> %
    do(fitHour = lm(price〜wind + temp,data =。))

    #在一个整洁的data_frame中按组获取系数
    dfHourCoef = tidy(dfHour, fitHour)
    dfHourCoef

    其中,

     资料来源:本地数据框[72 x 6] 
    组:小时

    小时期限估计std.error统计值p.value
    1 1(截取)53.336069324 21.33190104 2.5002961 0.022294293
    2 1风-0.008475175 0.01338668 -0.6331053 0.534626575
    3 1温度1.180019541 0.79178607 1.4903262 0.153453756
    4 2(截取)77.737788772 23.52048754 3.3051096 0.003936651
    5 2风-0.008437212 0.01432521 -0.5889765 0.563196358
    6 2温度-0.731265113 1.00109489 -0.7304653 0.474506855
    7 3(截取)38.292039924 17.55361626 2.1814331 0.042655670
    8 3风0.005422492 0.01407478 0.3852630 0.704557388
    9 3温度0.426765270 0.83672863 0.5100402 0.616220435
    10 4(截取)30.603119492 21.05059583 1.4537888 0.163219027
    .. ... ... ... ... ... ...


  2. augment b
    $ b

     #在一个整洁的data_frame中获取按组的预测
    dfHourPred = augment(dfHour,fitHour)
    dfHourPred

    其中,

     资料来源:本地数据框[504 x 11] 
    组:小时

    小时价格风温.fitted .se.fit .resid .hat .sigma .cooksd .std.resid
    1 1 83.8414055 67.3780 -6.199231 45.44982 22.42649 38.391590 0.27955950 42.24400 0.1470891067 1.0663820
    2 1 0.3061628 2073.7540 15.134085 53.61916 14.10041 -53.312993 0.11051343 41.43590 0.0735584714 -1.3327207
    3 1 80.3790032 520.5949 24.711938 78.08451 20.03558 2.294497 0.22312869 43.64059 0.0003 606305 0.0613746
    4 1 121.9023855 1618.0864 12.382588 54.23420 10.31293 67.668187 0.05911743 40.23212 0.0566557575 1.6447224
    5 1 -0.4039594 1542.8150 -5.544927 33.71732 14.53349 -34.121278 0.11740628 42.74697 0.0325125137 -0.8562896
    6 1 29.8269832 396.6951 6.134694 57.21307 16.04995 - 27.386085 0.14318542 43.05124 0.0271028701 -0.6975290
    7 1 -7.1865483 2009.9552 -5.657871 29.62495 16.93769 -36.811497 0.15946292 42.54487 0.0566686969 -0.9466312
    8 1 -7.8548693 2447.7092 22.043029 58.60251 19.94686 -66.457379 0.22115706 39.63999 0.2983443034 -1.7753911
    9 1 94.8736726 1525.3144 24.484066 69.30044 15.93352 25.573234 0.14111563 43.12898 0.0231796755 0.6505701
    10 1 54.4643001 2473.2234 -7.656520 23.34022 21.83043 31.124076 0.26489650 42.74790 0.0879837510 0.8558507
    .. ... ... ... ... ... ... ... ... ...


  3. glance

     #获取总结统计信息按组在一个整洁的data_frame 
    dfHourSumm = glance(dfHour,fitHour)
    dfHourSumm



    <资料来源:本地数据框[24 x 12]
    组:小时
    b
    $ b $
    hour r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
    1 1 0.12364561 0.02627290 42.41546 1.2698179 0.30487225 3 -106.8769 221.7538 225.9319 32383.29 18
    2 2 0.03506944 -0.07214506 36.79189 0.3270961 0.72521125 3 -103.8900 215.7799 219.9580 24365.58 18
    3 3 0.02805424 -0.07993974 39.33621 0.2597760 0.77406651 3 -105.2942 218.5884 222.7665 27852.07 18
    4 4 0.17640603 0.08489559 41.37115 1.9277147 0.17434859 3 -106.3534 220.7068 224.8849 30808.30 18
    5 5 0.12575453 0.02861615 42.27865 1.2945915 0.29833246 3 -106.8091 221 .6181 225.7962 32174.72 18
    6 6 0.08114417 -0.02095092 35.80062 0.7947901 0.46690268 3 -103.3164 214.6328 218.8109 23070.31 18
    7 7 0.21339168 0.12599076 32.77309 2.4415266 0.11529934 3 -101.4609 210.9218 215.0999 19333.36 18
    8 8 0.21655629 0.12950699 40.92788 2.4877430 0.11119114 3 -106.1272 220.2543 224.4324 30151.65 18
    9 9 0.23388711 0.14876346 35.48431 2.7476160 0.09091487 3 -103.1300 214.2601 218.4381 22664.45 18
    10 10 0.18326177 0.09251307 40.77241 2.0194425 0.16171339 3 -106.0472 220.0945 224.2726 29923.01 18
    ... ... ... ... ... ... ... ... ... ... ... ...



I would like to fit a model for each hour(the factor variable) using dplyr, Im getting an error, and im not quite sure what's wrong.

df.h <- data.frame( 
  hour     = factor(rep(1:24, each = 21)),
  price    = runif(504, min = -10, max = 125),
  wind     = runif(504, min = 0, max = 2500),
  temp     = runif(504, min = - 10, max = 25)  
)

df.h <- tbl_df(df.h)
df.h <- group_by(df.h, hour)

group_size(df.h) # checks out, 21 obs. for each factor variable

# different attempts:
reg.models <- do(df.h, formula = price ~ wind + temp)

reg.models <- do(df.h, .f = lm(price ~ wind + temp, data = df.h))

I've tried various variations, but I can't get it to work.

解决方案

The easiest way to do this, circa May 2015 is to use broom. broom contains three functions that deal with complex returned objects from statistical operations by groups: tidy (which deals with coefficient vectors from statistical operations by groups), glance (which deals with summary statistics from statistical operations by groups), and augment (which deals with observation level results from statistical operations by groups).

Here is a demonstration of its use to extract the various results of linear regression by groups into tidy data_frames.

  1. tidy:

    library(dplyr)
    library(broom)
    
    df.h = data.frame( 
      hour     = factor(rep(1:24, each = 21)),
      price    = runif(504, min = -10, max = 125),
      wind     = runif(504, min = 0, max = 2500),
      temp     = runif(504, min = - 10, max = 25)  
    )
    
    dfHour = df.h %>% group_by(hour) %>%
      do(fitHour = lm(price ~ wind + temp, data = .))
    
    # get the coefficients by group in a tidy data_frame
    dfHourCoef = tidy(dfHour, fitHour)
    dfHourCoef
    

    which gives,

        Source: local data frame [72 x 6]
        Groups: hour
    
    hour        term     estimate   std.error  statistic     p.value
    1     1 (Intercept) 53.336069324 21.33190104  2.5002961 0.022294293
    2     1        wind -0.008475175  0.01338668 -0.6331053 0.534626575
    3     1        temp  1.180019541  0.79178607  1.4903262 0.153453756
    4     2 (Intercept) 77.737788772 23.52048754  3.3051096 0.003936651
    5     2        wind -0.008437212  0.01432521 -0.5889765 0.563196358
    6     2        temp -0.731265113  1.00109489 -0.7304653 0.474506855
    7     3 (Intercept) 38.292039924 17.55361626  2.1814331 0.042655670
    8     3        wind  0.005422492  0.01407478  0.3852630 0.704557388
    9     3        temp  0.426765270  0.83672863  0.5100402 0.616220435
    10    4 (Intercept) 30.603119492 21.05059583  1.4537888 0.163219027
    ..  ...         ...          ...         ...        ...         ...
    

  2. augment:

     # get the predictions by group in a tidy data_frame
    dfHourPred = augment(dfHour, fitHour)
    dfHourPred
    

    which gives,

    Source: local data frame [504 x 11]
    Groups: hour
    
    hour       price      wind      temp  .fitted  .se.fit     .resid       .hat   .sigma      .cooksd .std.resid
    1     1  83.8414055   67.3780 -6.199231 45.44982 22.42649  38.391590 0.27955950 42.24400 0.1470891067  1.0663820
    2     1   0.3061628 2073.7540 15.134085 53.61916 14.10041 -53.312993 0.11051343 41.43590 0.0735584714 -1.3327207
    3     1  80.3790032  520.5949 24.711938 78.08451 20.03558   2.294497 0.22312869 43.64059 0.0003606305  0.0613746
    4     1 121.9023855 1618.0864 12.382588 54.23420 10.31293  67.668187 0.05911743 40.23212 0.0566557575  1.6447224
    5     1  -0.4039594 1542.8150 -5.544927 33.71732 14.53349 -34.121278 0.11740628 42.74697 0.0325125137 -0.8562896
    6     1  29.8269832  396.6951  6.134694 57.21307 16.04995 -27.386085 0.14318542 43.05124 0.0271028701 -0.6975290
    7     1  -7.1865483 2009.9552 -5.657871 29.62495 16.93769 -36.811497 0.15946292 42.54487 0.0566686969 -0.9466312
    8     1  -7.8548693 2447.7092 22.043029 58.60251 19.94686 -66.457379 0.22115706 39.63999 0.2983443034 -1.7753911
    9     1  94.8736726 1525.3144 24.484066 69.30044 15.93352  25.573234 0.14111563 43.12898 0.0231796755  0.6505701
    10    1  54.4643001 2473.2234 -7.656520 23.34022 21.83043  31.124076 0.26489650 42.74790 0.0879837510  0.8558507
    ..  ...         ...       ...       ...      ...      ...        ...        ...      ...          ...        ...
    

  3. glance:

    # get the summary statistics by group in a tidy data_frame
    dfHourSumm = glance(dfHour, fitHour)
    dfHourSumm
    

    which gives,

    Source: local data frame [24 x 12]
    Groups: hour
    
    hour  r.squared adj.r.squared    sigma statistic    p.value df    logLik      AIC      BIC deviance df.residual
    1     1 0.12364561    0.02627290 42.41546 1.2698179 0.30487225  3 -106.8769 221.7538 225.9319 32383.29          18
    2     2 0.03506944   -0.07214506 36.79189 0.3270961 0.72521125  3 -103.8900 215.7799 219.9580 24365.58          18
    3     3 0.02805424   -0.07993974 39.33621 0.2597760 0.77406651  3 -105.2942 218.5884 222.7665 27852.07          18
    4     4 0.17640603    0.08489559 41.37115 1.9277147 0.17434859  3 -106.3534 220.7068 224.8849 30808.30          18
    5     5 0.12575453    0.02861615 42.27865 1.2945915 0.29833246  3 -106.8091 221.6181 225.7962 32174.72          18
    6     6 0.08114417   -0.02095092 35.80062 0.7947901 0.46690268  3 -103.3164 214.6328 218.8109 23070.31          18
    7     7 0.21339168    0.12599076 32.77309 2.4415266 0.11529934  3 -101.4609 210.9218 215.0999 19333.36          18
    8     8 0.21655629    0.12950699 40.92788 2.4877430 0.11119114  3 -106.1272 220.2543 224.4324 30151.65          18
    9     9 0.23388711    0.14876346 35.48431 2.7476160 0.09091487  3 -103.1300 214.2601 218.4381 22664.45          18
    10   10 0.18326177    0.09251307 40.77241 2.0194425 0.16171339  3 -106.0472 220.0945 224.2726 29923.01          18
    ..  ...        ...           ...      ...       ...        ... ..       ...      ...      ...      ...         ...
    

这篇关于用dplyr拟合几个回归模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆