工作日作为使用 statsmodels 的线性回归模型中的虚拟/因子变量 [英] Weekday as dummy / factor variable in a linear regression model using statsmodels

查看:81
本文介绍了工作日作为使用 statsmodels 的线性回归模型中的虚拟/因子变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:

如何使用 sm.OLS() 向模型添加虚拟/因子变量?

详情:

数据样本结构:

日期 A B 工作日2013-05-04 25.03 88.51 星期六2013-05-05 52.98 67.99 星期日2013-05-06 39.93 75.19 星期一2013-05-07 47.31 86.99 周二2013-05-08 19.61 87.94 星期三2013-05-09 39.51 83.10 星期四2013-05-10 21.22 62.16 星期五2013-05-11 19.04 58.79 星期六2013-05-12 18.53 75.27 星期日2013-05-13 11.90 75.43 星期一2013-05-14 47.64 64.76 周二2013-05-15 27.47 91.65 星期三2013-05-16 11.20 59.83 星期四2013-05-17 25.10 67.47 星期五2013-05-18 19.89 64.70 星期六2013-05-19 38.91 76.68 星期日2013-05-20 42.11 94.36 星期一2013-05-21 7.845 73.67 周二2013-05-22 35.45 76.67 星期三2013-05-23 29.43 79.05 星期四2013-05-24 33.51 78.53 星期五2013-05-25 13.58 59.26 星期六2013-05-26 37.38 68.59 星期日2013-05-27 37.09 67.79 星期一2013-05-28 21.70 70.54 星期二2013-05-29 11.85 60.00 星期三

以下使用 sm.ols()(包括使用 sm.add_constant() 的常数项)创建 B 对 A 的线性回归模型

使用 statsmodels 进行回归分析的完整代码和数据样本:

# 导入将熊猫导入为 pd将 statsmodels.api 导入为 sm# 与上述相同的数据数据 = {'日期':{0:'2013-05-04',1:'2013-05-05',2:'2013-05-06',3:'2013-05-07',4:'2013-05-08',5:'2013-05-09',6:'2013-05-10',7:'2013-05-11',8:'2013-05-12',9:'2013-05-13',10: '2013-05-14',11: '2013-05-15',12: '2013-05-16',13: '2013-05-17',14: '2013-05-18',15: '2013-05-19',16: '2013-05-20',17: '2013-05-21',18: '2013-05-22',19: '2013-05-23',20: '2013-05-24',21: '2013-05-25',22: '2013-05-26',23: '2013-05-27',24: '2013-05-28',25: '2013-05-29'},'A':{0:25.03,1: 52.98,2: 39.93,3: 47.31,4: 19.61,5: 39.51,6: 21.22,7: 19.04,8: 18.53,9: 11.9,10: 47.64,11: 27.47,12: 11.2,13: 25.1,14: 19.89,15: 38.91,16: 42.11,17: 7.845,18: 35.45,19: 29.43,20: 33.51,21: 13.58,22: 37.38,23: 37.09,24: 21.7,25:11.85},'B':{0:88.51,1: 67.99,2:75.19,3:86.99,4:87.94,5:83.1,6: 62.16,7: 58.79,8: 75.27,9: 75.43,10: 64.76,11: 91.65,12: 59.83,13: 67.47,14: 64.7,15: 76.68,16: 94.36,17: 73.67,18: 76.67,19: 79.05,20: 78.53,21: 59.26,22: 68.59,23: 67.79,24: 70.54,25:60.0},工作日":{0:星期六",1:'星期天',2:'星期一',3:'星期二',4:星期三"5:星期四",6:星期五"7:星期六"8:星期天"9:星期一"10:星期二"11:星期三"12:星期四"13:星期五"14:星期六"15:星期天"16:星期一"17:星期二"18:星期三"19:星期四"20:星期五"21:星期六"22:星期天"23:星期一"24:星期二"25:'星期三'}}df = pd.DataFrame(数据)df = df.set_index(['日期'])df['weekday'] = df['weekday'].astype(object)独立 = df['B'].to_frame()x = sm.add_constant(独立)模型 = sm.OLS(df['A'], x).fit()模型摘要()

输出(缩短):

 coef std err t P>|t|[95.0% Conf.国际]-------------------------------------------------------------------------------常量 -1.4328 17.355 -0.083 0.935 -37.252 34.386乙 0.4034 0.233 1.729 0.097 -0.078 0.885==============================================================================

现在我想添加工作日作为解释因素变量.我希望它会像更改数据框中的数据类型一样简单,但不幸的是,尽管 x = sm.add_constant(independent) 部分接受了该列,但这似乎不起作用.

将pandas导入为pd将 statsmodels.api 导入为 smdf = pd.read_clipboard(sep='\\s+')df = df.set_index(['日期'])df['weekday'] = df['weekday'].astype(object)独立 = df[['B', '工作日']]x = sm.add_constant(独立)模型 = sm.OLS(df['A'], x).fit()模型摘要()

当您来到 model = sm.OLS(df['A'], x).fit() 部分时,会引发值错误:

ValueError: Pandas 数据转换为对象的 numpy dtype.使用 np.asarray(data) 检查输入数据.

还有其他建议吗?

解决方案

您可以使用 pandas categorical 来创建虚拟变量,或者更简单地使用公式接口,其中 patsy 将所有非数字列转换为虚拟变量,或者其他因素编码.

在这种情况下使用公式接口(与statsmodels.formula.api中的小写ols相同)显示以下结果.Patsy 按字母顺序对分类变量的级别进行排序.变量列表中缺少星期五",已被选为参考类别.

<预><代码>>>>res = sm.OLS.from_formula('A ~ B + 工作日', df).fit()>>>打印(res.summary())OLS 回归结果==============================================================================部变量:A R 平方:0.301型号:OLS Adj.R平方:0.029方法:最小二乘 F 统计量:1.105日期:2018 年 5 月 3 日星期四 概率(F 统计量):0.401时间:15:26:02 对数似然:-97.898编号. 观察:26 AIC:211.8Df 残差:18 BIC:221.9Df 型号:7协方差类型:非稳健========================================================================================coef std err t P>|t|[0.025 0.975]---------------------------------------------------------------------------------截距 -1.4717 19.343 -0.076 0.940 -42.110 39.167工作日[T.Monday] 2.5837 9.857 0.262 0.796 -18.124 23.291工作日[T.Saturday] -6.5889 9.599 -0.686 0.501 -26.755 13.577工作日[T.Sunday] 9.2287 9.616 0.960 0.350 -10.975 29.432工作日[T.星期四] -1.7610 10.321 -0.171 0.866 -23.445 19.923工作日[T.Tuesday] 2.6507 9.664 0.274 0.787 -17.652 22.953工作日[T.Wendesday] -6.9320 9.911 -0.699 0.493 -27.754 13.890乙 0.4047 0.258 1.566 0.135 -0.138 0.948==============================================================================综合:1.039 Durbin-Watson:2.313概率(综合):0.595 Jarque-Bera (JB):0.532偏斜:-0.350 概率(JB):0.766峰度:3.007 条件.第 638 号.==============================================================================警告:[1] 标准误差假设误差的协方差矩阵被正确指定.

有关分类编码的选项,请参阅 patsy 文档 http://patsy.readthedocs.io/en/latest/categorical-coding.html

例如,参考编码可以在这个公式中明确指定

"A ~ B + C(weekday, Treatment('Sunday'))"

http://patsy.readthedocs.io/en/latest/API-reference.html#patsy.Treatment

The question:

How can I add a dummy / factor variable to a model using sm.OLS()?

The details:

Data sample structure:

Date    A   B   weekday
2013-05-04  25.03   88.51   Saturday
2013-05-05  52.98   67.99   Sunday
2013-05-06  39.93   75.19   Monday
2013-05-07  47.31   86.99   Tuesday
2013-05-08  19.61   87.94   Wednesday
2013-05-09  39.51   83.10   Thursday
2013-05-10  21.22   62.16   Friday
2013-05-11  19.04   58.79   Saturday
2013-05-12  18.53   75.27   Sunday
2013-05-13  11.90   75.43   Monday
2013-05-14  47.64   64.76   Tuesday
2013-05-15  27.47   91.65   Wednesday
2013-05-16  11.20   59.83   Thursday
2013-05-17  25.10   67.47   Friday
2013-05-18  19.89   64.70   Saturday
2013-05-19  38.91   76.68   Sunday
2013-05-20  42.11   94.36   Monday
2013-05-21  7.845   73.67   Tuesday
2013-05-22  35.45   76.67   Wednesday
2013-05-23  29.43   79.05   Thursday
2013-05-24  33.51   78.53   Friday
2013-05-25  13.58   59.26   Saturday
2013-05-26  37.38   68.59   Sunday
2013-05-27  37.09   67.79   Monday
2013-05-28  21.70   70.54   Tuesday
2013-05-29  11.85   60.00   Wednesday

The following creates a linear regression model of B on A using sm.ols() (including a constant term using sm.add_constant())

Complete code with data sample for regression analysis using statsmodels:

# imports
import pandas as pd
import statsmodels.api as sm

# same data as described above
data = {'Date': {0: '2013-05-04',
          1: '2013-05-05',
          2: '2013-05-06',
          3: '2013-05-07',
          4: '2013-05-08',
          5: '2013-05-09',
          6: '2013-05-10',
          7: '2013-05-11',
          8: '2013-05-12',
          9: '2013-05-13',
          10: '2013-05-14',
          11: '2013-05-15',
          12: '2013-05-16',
          13: '2013-05-17',
          14: '2013-05-18',
          15: '2013-05-19',
          16: '2013-05-20',
          17: '2013-05-21',
          18: '2013-05-22',
          19: '2013-05-23',
          20: '2013-05-24',
          21: '2013-05-25',
          22: '2013-05-26',
          23: '2013-05-27',
          24: '2013-05-28',
          25: '2013-05-29'},
         'A': {0: 25.03,
          1: 52.98,
          2: 39.93,
          3: 47.31,
          4: 19.61,
          5: 39.51,
          6: 21.22,
          7: 19.04,
          8: 18.53,
          9: 11.9,
          10: 47.64,
          11: 27.47,
          12: 11.2,
          13: 25.1,
          14: 19.89,
          15: 38.91,
          16: 42.11,
          17: 7.845,
          18: 35.45,
          19: 29.43,
          20: 33.51,
          21: 13.58,
          22: 37.38,
          23: 37.09,
          24: 21.7,
          25: 11.85},
         'B': {0: 88.51,
          1: 67.99,
          2: 75.19,
          3: 86.99,
          4: 87.94,
          5: 83.1,
          6: 62.16,
          7: 58.79,
          8: 75.27,
          9: 75.43,
          10: 64.76,
          11: 91.65,
          12: 59.83,
          13: 67.47,
          14: 64.7,
          15: 76.68,
          16: 94.36,
          17: 73.67,
          18: 76.67,
          19: 79.05,
          20: 78.53,
          21: 59.26,
          22: 68.59,
          23: 67.79,
          24: 70.54,
          25: 60.0},
         'weekday': {0: 'Saturday',
          1: 'Sunday',
          2: 'Monday',
          3: 'Tuesday',
          4: 'Wednesday',
          5: 'Thursday',
          6: 'Friday',
          7: 'Saturday',
          8: 'Sunday',
          9: 'Monday',
          10: 'Tuesday',
          11: 'Wednesday',
          12: 'Thursday',
          13: 'Friday',
          14: 'Saturday',
          15: 'Sunday',
          16: 'Monday',
          17: 'Tuesday',
          18: 'Wednesday',
          19: 'Thursday',
          20: 'Friday',
          21: 'Saturday',
          22: 'Sunday',
          23: 'Monday',
          24: 'Tuesday',
          25: 'Wednesday'}}

df = pd.DataFrame(data)
df = df.set_index(['Date'])

df['weekday'] =  df['weekday'].astype(object)
independent = df['B'].to_frame()
x = sm.add_constant(independent)

model = sm.OLS(df['A'], x).fit()
model.summary()

Output (shortened):

                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -1.4328     17.355     -0.083      0.935       -37.252    34.386
B              0.4034      0.233      1.729      0.097        -0.078     0.885
==============================================================================

Now I'd like to add weekday as an explanatory factor variable. I was hoping it would be as easy as changing the data type in the dataframe, but unfortunately that doesn't seem to work although the column was accepted by the x = sm.add_constant(independent) part.

import pandas as pd
import statsmodels.api as sm

df = pd.read_clipboard(sep='\\s+')
df = df.set_index(['Date'])

df['weekday'] =  df['weekday'].astype(object)

independent = df[['B', 'weekday']]
x = sm.add_constant(independent)

model = sm.OLS(df['A'], x).fit()
model.summary()

When you come to the model = sm.OLS(df['A'], x).fit() part, a value error is raised:

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

Any other suggestions?

解决方案

You can use pandas categorical to create the dummy variables, or, simpler, use the formula interface where patsy transforms all non-numeric columns to the dummy variables, or other factor encoding.

Using the formula interface in this case (same as lower case ols in statsmodels.formula.api) shows the result below. Patsy sorts levels of the categorical variable alphabetically. 'Friday' is missing in the list of variables and has been selected as reference category.

>>> res = sm.OLS.from_formula('A ~ B + weekday', df).fit()
>>> print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      A   R-squared:                       0.301
Model:                            OLS   Adj. R-squared:                  0.029
Method:                 Least Squares   F-statistic:                     1.105
Date:                Thu, 03 May 2018   Prob (F-statistic):              0.401
Time:                        15:26:02   Log-Likelihood:                -97.898
No. Observations:                  26   AIC:                             211.8
Df Residuals:                      18   BIC:                             221.9
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept               -1.4717     19.343     -0.076      0.940     -42.110      39.167
weekday[T.Monday]        2.5837      9.857      0.262      0.796     -18.124      23.291
weekday[T.Saturday]     -6.5889      9.599     -0.686      0.501     -26.755      13.577
weekday[T.Sunday]        9.2287      9.616      0.960      0.350     -10.975      29.432
weekday[T.Thursday]     -1.7610     10.321     -0.171      0.866     -23.445      19.923
weekday[T.Tuesday]       2.6507      9.664      0.274      0.787     -17.652      22.953
weekday[T.Wendesday]    -6.9320      9.911     -0.699      0.493     -27.754      13.890
B                        0.4047      0.258      1.566      0.135      -0.138       0.948
==============================================================================
Omnibus:                        1.039   Durbin-Watson:                   2.313
Prob(Omnibus):                  0.595   Jarque-Bera (JB):                0.532
Skew:                          -0.350   Prob(JB):                        0.766
Kurtosis:                       3.007   Cond. No.                         638.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

See patsy documentation for options for categorical encodings http://patsy.readthedocs.io/en/latest/categorical-coding.html

For example, the reference coding can be specified explicitly as in this formula

"A ~ B + C(weekday, Treatment('Sunday'))"

http://patsy.readthedocs.io/en/latest/API-reference.html#patsy.Treatment

这篇关于工作日作为使用 statsmodels 的线性回归模型中的虚拟/因子变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆