使用Pandas Data Frame运行OLS回归 [英] Run an OLS regression with Pandas Data Frame
问题描述
我有一个pandas
数据帧,我希望能够根据B和C列中的值预测A列的值.这是一个玩具示例:
I have a pandas
data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:
import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50],
"B": [20, 30, 10, 40, 50],
"C": [32, 234, 23, 23, 42523]})
理想情况下,我会有类似ols(A ~ B + C, data = df)
的东西,但是当我查看例如scikit-learn
之类的算法库中的示例,它似乎使用行而不是列的列表将数据提供给模型.这将需要我将数据重新格式化为列表内的列表,这似乎首先使使用熊猫的目的遭到了破坏.在熊猫数据框中的数据上运行OLS回归(或更通用的任何机器学习算法)的最Python方式是什么?
Ideally, I would have something like ols(A ~ B + C, data = df)
but when I look at the examples from algorithm libraries like scikit-learn
it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. What is the most pythonic way to run an OLS regression (or any machine learning algorithm more generally) on data in a pandas data frame?
推荐答案
我认为,使用 statsmodels 软件包,它是pandas
'0.20.0版之前的pandas
'可选依赖项之一(在pandas.stats
中用于某些用途.)
I think you can almost do exactly what you thought would be ideal, using the statsmodels package which was one of pandas
' optional dependencies before pandas
' version 0.20.0 (it was used for a few things in pandas.stats
.)
>>> import pandas as pd
>>> import statsmodels.formula.api as sm
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> result = sm.ols(formula="A ~ B + C", data=df).fit()
>>> print(result.params)
Intercept 14.952480
B 0.401182
C 0.000352
dtype: float64
>>> print(result.summary())
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.579
Model: OLS Adj. R-squared: 0.158
Method: Least Squares F-statistic: 1.375
Date: Thu, 14 Nov 2013 Prob (F-statistic): 0.421
Time: 20:04:30 Log-Likelihood: -18.178
No. Observations: 5 AIC: 42.36
Df Residuals: 2 BIC: 41.19
Df Model: 2
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 14.9525 17.764 0.842 0.489 -61.481 91.386
B 0.4012 0.650 0.617 0.600 -2.394 3.197
C 0.0004 0.001 0.650 0.583 -0.002 0.003
==============================================================================
Omnibus: nan Durbin-Watson: 1.061
Prob(Omnibus): nan Jarque-Bera (JB): 0.498
Skew: -0.123 Prob(JB): 0.780
Kurtosis: 1.474 Cond. No. 5.21e+04
==============================================================================
Warnings:
[1] The condition number is large, 5.21e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
这篇关于使用Pandas Data Frame运行OLS回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!