pandas 统计模型中的多元线性回归:ValueError [英] Multiple linear regression in pandas statsmodels: ValueError
问题描述
数据: https://courses.edx.org/c4x /MITx/15.071x_2/asset/NBA_train.csv
我知道如何使用statsmodels.formula.api
将这些数据拟合为多元线性回归模型:
I know how to fit these data to a multiple linear regression model using statsmodels.formula.api
:
import pandas as pd
NBA = pd.read_csv("NBA_train.csv")
import statsmodels.formula.api as smf
model = smf.ols(formula="W ~ PTS + oppPTS", data=NBA).fit()
model.summary()
但是,我发现这种类似于R的公式表示法很尴尬,我想使用通常的pandas语法:
However, I find this R-like formula notation awkward and I'd like to use the usual pandas syntax:
import pandas as pd
NBA = pd.read_csv("NBA_train.csv")
import statsmodels.api as sm
X = NBA['W']
y = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()
使用第二种方法时,出现以下错误:
Using the second method I get the following error:
ValueError: shapes (835,2) and (835,2) not aligned: 2 (dim 1) != 835 (dim 0)
为什么会发生以及如何解决?
Why does it happen and how to fix it?
推荐答案
使用sm.OLS(y, X)
时,y
是因变量,而X
是因变量
自变量.
When using sm.OLS(y, X)
, y
is the dependent variable, and X
are the
independent variables.
在公式W ~ PTS + oppPTS
中,W
是因变量,而PTS
和oppPTS
是自变量.
In the formula W ~ PTS + oppPTS
, W
is the dependent variable and PTS
and oppPTS
are the independent variables.
因此,使用
y = NBA['W']
X = NBA[['PTS', 'oppPTS']]
代替
X = NBA['W']
y = NBA[['PTS', 'oppPTS']]
import pandas as pd
import statsmodels.api as sm
NBA = pd.read_csv("NBA_train.csv")
y = NBA['W']
X = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()
收益
OLS Regression Results
==============================================================================
Dep. Variable: W R-squared: 0.942
Model: OLS Adj. R-squared: 0.942
Method: Least Squares F-statistic: 6799.
Date: Sat, 21 Mar 2015 Prob (F-statistic): 0.00
Time: 14:58:05 Log-Likelihood: -2118.0
No. Observations: 835 AIC: 4242.
Df Residuals: 832 BIC: 4256.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 41.3048 1.610 25.652 0.000 38.144 44.465
PTS 0.0326 0.000 109.600 0.000 0.032 0.033
oppPTS -0.0326 0.000 -110.951 0.000 -0.033 -0.032
==============================================================================
Omnibus: 1.026 Durbin-Watson: 2.238
Prob(Omnibus): 0.599 Jarque-Bera (JB): 0.984
Skew: 0.084 Prob(JB): 0.612
Kurtosis: 3.009 Cond. No. 1.80e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.8e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
这篇关于 pandas 统计模型中的多元线性回归:ValueError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!