工作日作为使用 statsmodels 的线性回归模型中的虚拟/因子变量 [英] Weekday as dummy / factor variable in a linear regression model using statsmodels
问题描述
问题:
如何使用 sm.OLS()
向模型添加虚拟/因子变量?
详情:
数据样本结构:
日期 A B 工作日2013-05-04 25.03 88.51 星期六2013-05-05 52.98 67.99 星期日2013-05-06 39.93 75.19 星期一2013-05-07 47.31 86.99 周二2013-05-08 19.61 87.94 星期三2013-05-09 39.51 83.10 星期四2013-05-10 21.22 62.16 星期五2013-05-11 19.04 58.79 星期六2013-05-12 18.53 75.27 星期日2013-05-13 11.90 75.43 星期一2013-05-14 47.64 64.76 周二2013-05-15 27.47 91.65 星期三2013-05-16 11.20 59.83 星期四2013-05-17 25.10 67.47 星期五2013-05-18 19.89 64.70 星期六2013-05-19 38.91 76.68 星期日2013-05-20 42.11 94.36 星期一2013-05-21 7.845 73.67 周二2013-05-22 35.45 76.67 星期三2013-05-23 29.43 79.05 星期四2013-05-24 33.51 78.53 星期五2013-05-25 13.58 59.26 星期六2013-05-26 37.38 68.59 星期日2013-05-27 37.09 67.79 星期一2013-05-28 21.70 70.54 星期二2013-05-29 11.85 60.00 星期三
以下使用 sm.ols()
(包括使用 sm.add_constant()
的常数项)创建 B 对 A 的线性回归模型
使用 statsmodels 进行回归分析的完整代码和数据样本:
# 导入将熊猫导入为 pd将 statsmodels.api 导入为 sm# 与上述相同的数据数据 = {'日期':{0:'2013-05-04',1:'2013-05-05',2:'2013-05-06',3:'2013-05-07',4:'2013-05-08',5:'2013-05-09',6:'2013-05-10',7:'2013-05-11',8:'2013-05-12',9:'2013-05-13',10: '2013-05-14',11: '2013-05-15',12: '2013-05-16',13: '2013-05-17',14: '2013-05-18',15: '2013-05-19',16: '2013-05-20',17: '2013-05-21',18: '2013-05-22',19: '2013-05-23',20: '2013-05-24',21: '2013-05-25',22: '2013-05-26',23: '2013-05-27',24: '2013-05-28',25: '2013-05-29'},'A':{0:25.03,1: 52.98,2: 39.93,3: 47.31,4: 19.61,5: 39.51,6: 21.22,7: 19.04,8: 18.53,9: 11.9,10: 47.64,11: 27.47,12: 11.2,13: 25.1,14: 19.89,15: 38.91,16: 42.11,17: 7.845,18: 35.45,19: 29.43,20: 33.51,21: 13.58,22: 37.38,23: 37.09,24: 21.7,25:11.85},'B':{0:88.51,1: 67.99,2:75.19,3:86.99,4:87.94,5:83.1,6: 62.16,7: 58.79,8: 75.27,9: 75.43,10: 64.76,11: 91.65,12: 59.83,13: 67.47,14: 64.7,15: 76.68,16: 94.36,17: 73.67,18: 76.67,19: 79.05,20: 78.53,21: 59.26,22: 68.59,23: 67.79,24: 70.54,25:60.0},工作日":{0:星期六",1:'星期天',2:'星期一',3:'星期二',4:星期三"5:星期四",6:星期五"7:星期六"8:星期天"9:星期一"10:星期二"11:星期三"12:星期四"13:星期五"14:星期六"15:星期天"16:星期一"17:星期二"18:星期三"19:星期四"20:星期五"21:星期六"22:星期天"23:星期一"24:星期二"25:'星期三'}}df = pd.DataFrame(数据)df = df.set_index(['日期'])df['weekday'] = df['weekday'].astype(object)独立 = df['B'].to_frame()x = sm.add_constant(独立)模型 = sm.OLS(df['A'], x).fit()模型摘要()
输出(缩短):
coef std err t P>|t|[95.0% Conf.国际]-------------------------------------------------------------------------------常量 -1.4328 17.355 -0.083 0.935 -37.252 34.386乙 0.4034 0.233 1.729 0.097 -0.078 0.885==============================================================================
现在我想添加工作日作为解释因素变量.我希望它会像更改数据框中的数据类型一样简单,但不幸的是,尽管 x = sm.add_constant(independent)
部分接受了该列,但这似乎不起作用.
将pandas导入为pd将 statsmodels.api 导入为 smdf = pd.read_clipboard(sep='\\s+')df = df.set_index(['日期'])df['weekday'] = df['weekday'].astype(object)独立 = df[['B', '工作日']]x = sm.add_constant(独立)模型 = sm.OLS(df['A'], x).fit()模型摘要()
当您来到 model = sm.OLS(df['A'], x).fit()
部分时,会引发值错误:
ValueError: Pandas 数据转换为对象的 numpy dtype.使用 np.asarray(data) 检查输入数据.
还有其他建议吗?
您可以使用 pandas categorical 来创建虚拟变量,或者更简单地使用公式接口,其中 patsy 将所有非数字列转换为虚拟变量,或者其他因素编码.
在这种情况下使用公式接口(与statsmodels.formula.api中的小写ols
相同)显示以下结果.Patsy 按字母顺序对分类变量的级别进行排序.变量列表中缺少星期五",已被选为参考类别.
有关分类编码的选项,请参阅 patsy 文档 http://patsy.readthedocs.io/en/latest/categorical-coding.html
例如,参考编码可以在这个公式中明确指定
"A ~ B + C(weekday, Treatment('Sunday'))"
http://patsy.readthedocs.io/en/latest/API-reference.html#patsy.Treatment
The question:
How can I add a dummy / factor variable to a model using sm.OLS()
?
The details:
Data sample structure:
Date A B weekday
2013-05-04 25.03 88.51 Saturday
2013-05-05 52.98 67.99 Sunday
2013-05-06 39.93 75.19 Monday
2013-05-07 47.31 86.99 Tuesday
2013-05-08 19.61 87.94 Wednesday
2013-05-09 39.51 83.10 Thursday
2013-05-10 21.22 62.16 Friday
2013-05-11 19.04 58.79 Saturday
2013-05-12 18.53 75.27 Sunday
2013-05-13 11.90 75.43 Monday
2013-05-14 47.64 64.76 Tuesday
2013-05-15 27.47 91.65 Wednesday
2013-05-16 11.20 59.83 Thursday
2013-05-17 25.10 67.47 Friday
2013-05-18 19.89 64.70 Saturday
2013-05-19 38.91 76.68 Sunday
2013-05-20 42.11 94.36 Monday
2013-05-21 7.845 73.67 Tuesday
2013-05-22 35.45 76.67 Wednesday
2013-05-23 29.43 79.05 Thursday
2013-05-24 33.51 78.53 Friday
2013-05-25 13.58 59.26 Saturday
2013-05-26 37.38 68.59 Sunday
2013-05-27 37.09 67.79 Monday
2013-05-28 21.70 70.54 Tuesday
2013-05-29 11.85 60.00 Wednesday
The following creates a linear regression model of B on A using sm.ols()
(including a constant term using sm.add_constant()
)
Complete code with data sample for regression analysis using statsmodels:
# imports
import pandas as pd
import statsmodels.api as sm
# same data as described above
data = {'Date': {0: '2013-05-04',
1: '2013-05-05',
2: '2013-05-06',
3: '2013-05-07',
4: '2013-05-08',
5: '2013-05-09',
6: '2013-05-10',
7: '2013-05-11',
8: '2013-05-12',
9: '2013-05-13',
10: '2013-05-14',
11: '2013-05-15',
12: '2013-05-16',
13: '2013-05-17',
14: '2013-05-18',
15: '2013-05-19',
16: '2013-05-20',
17: '2013-05-21',
18: '2013-05-22',
19: '2013-05-23',
20: '2013-05-24',
21: '2013-05-25',
22: '2013-05-26',
23: '2013-05-27',
24: '2013-05-28',
25: '2013-05-29'},
'A': {0: 25.03,
1: 52.98,
2: 39.93,
3: 47.31,
4: 19.61,
5: 39.51,
6: 21.22,
7: 19.04,
8: 18.53,
9: 11.9,
10: 47.64,
11: 27.47,
12: 11.2,
13: 25.1,
14: 19.89,
15: 38.91,
16: 42.11,
17: 7.845,
18: 35.45,
19: 29.43,
20: 33.51,
21: 13.58,
22: 37.38,
23: 37.09,
24: 21.7,
25: 11.85},
'B': {0: 88.51,
1: 67.99,
2: 75.19,
3: 86.99,
4: 87.94,
5: 83.1,
6: 62.16,
7: 58.79,
8: 75.27,
9: 75.43,
10: 64.76,
11: 91.65,
12: 59.83,
13: 67.47,
14: 64.7,
15: 76.68,
16: 94.36,
17: 73.67,
18: 76.67,
19: 79.05,
20: 78.53,
21: 59.26,
22: 68.59,
23: 67.79,
24: 70.54,
25: 60.0},
'weekday': {0: 'Saturday',
1: 'Sunday',
2: 'Monday',
3: 'Tuesday',
4: 'Wednesday',
5: 'Thursday',
6: 'Friday',
7: 'Saturday',
8: 'Sunday',
9: 'Monday',
10: 'Tuesday',
11: 'Wednesday',
12: 'Thursday',
13: 'Friday',
14: 'Saturday',
15: 'Sunday',
16: 'Monday',
17: 'Tuesday',
18: 'Wednesday',
19: 'Thursday',
20: 'Friday',
21: 'Saturday',
22: 'Sunday',
23: 'Monday',
24: 'Tuesday',
25: 'Wednesday'}}
df = pd.DataFrame(data)
df = df.set_index(['Date'])
df['weekday'] = df['weekday'].astype(object)
independent = df['B'].to_frame()
x = sm.add_constant(independent)
model = sm.OLS(df['A'], x).fit()
model.summary()
Output (shortened):
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const -1.4328 17.355 -0.083 0.935 -37.252 34.386
B 0.4034 0.233 1.729 0.097 -0.078 0.885
==============================================================================
Now I'd like to add weekday as an explanatory factor variable. I was hoping it would be as easy as changing the data type in the dataframe, but unfortunately that doesn't seem to work although the column was accepted by the x = sm.add_constant(independent)
part.
import pandas as pd
import statsmodels.api as sm
df = pd.read_clipboard(sep='\\s+')
df = df.set_index(['Date'])
df['weekday'] = df['weekday'].astype(object)
independent = df[['B', 'weekday']]
x = sm.add_constant(independent)
model = sm.OLS(df['A'], x).fit()
model.summary()
When you come to the model = sm.OLS(df['A'], x).fit()
part, a value error is raised:
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
Any other suggestions?
You can use pandas categorical to create the dummy variables, or, simpler, use the formula interface where patsy transforms all non-numeric columns to the dummy variables, or other factor encoding.
Using the formula interface in this case (same as lower case ols
in statsmodels.formula.api) shows the result below.
Patsy sorts levels of the categorical variable alphabetically. 'Friday' is missing in the list of variables and has been selected as reference category.
>>> res = sm.OLS.from_formula('A ~ B + weekday', df).fit()
>>> print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.301
Model: OLS Adj. R-squared: 0.029
Method: Least Squares F-statistic: 1.105
Date: Thu, 03 May 2018 Prob (F-statistic): 0.401
Time: 15:26:02 Log-Likelihood: -97.898
No. Observations: 26 AIC: 211.8
Df Residuals: 18 BIC: 221.9
Df Model: 7
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
Intercept -1.4717 19.343 -0.076 0.940 -42.110 39.167
weekday[T.Monday] 2.5837 9.857 0.262 0.796 -18.124 23.291
weekday[T.Saturday] -6.5889 9.599 -0.686 0.501 -26.755 13.577
weekday[T.Sunday] 9.2287 9.616 0.960 0.350 -10.975 29.432
weekday[T.Thursday] -1.7610 10.321 -0.171 0.866 -23.445 19.923
weekday[T.Tuesday] 2.6507 9.664 0.274 0.787 -17.652 22.953
weekday[T.Wendesday] -6.9320 9.911 -0.699 0.493 -27.754 13.890
B 0.4047 0.258 1.566 0.135 -0.138 0.948
==============================================================================
Omnibus: 1.039 Durbin-Watson: 2.313
Prob(Omnibus): 0.595 Jarque-Bera (JB): 0.532
Skew: -0.350 Prob(JB): 0.766
Kurtosis: 3.007 Cond. No. 638.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
See patsy documentation for options for categorical encodings http://patsy.readthedocs.io/en/latest/categorical-coding.html
For example, the reference coding can be specified explicitly as in this formula
"A ~ B + C(weekday, Treatment('Sunday'))"
http://patsy.readthedocs.io/en/latest/API-reference.html#patsy.Treatment
这篇关于工作日作为使用 statsmodels 的线性回归模型中的虚拟/因子变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!