使用 Pandas 数据框运行 OLS 回归 [英] Run an OLS regression with Pandas Data Frame

查看:22
本文介绍了使用 Pandas 数据框运行 OLS 回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 pandas 数据框,我希望能够根据 B 列和 C 列中的值预测 A 列的值.这是一个玩具示例:

将pandas导入为pddf = pd.DataFrame({"A": [10,20,30,40,50],"B": [20, 30, 10, 40, 50],"C": [32, 234, 23, 23, 42523]})

理想情况下,我会有类似 ols(A ~ B + C, data = df) 但当我查看 examples 来自诸如 scikit-learn 之类的算法库,它似乎将数据提供给带有行列表的模型列.这将需要我将数据重新格式化为列表内的列表,这似乎首先违背了使用熊猫的目的.对 Pandas 数据框中的数据运行 OLS 回归(或更一般的任何机器学习算法)的最pythonic 方法是什么?

解决方案

我认为使用 statsmodels 软件包,它是 pandas' 0.20.0 版之前的 pandas' 可选依赖项之一(它在 <代码>pandas.stats.)

<预><代码>>>>将熊猫导入为 pd>>>将 statsmodels.formula.api 导入为 sm>>>df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23,23, 42523]})>>>结果 = sm.ols(formula="A ~ B + C", data=df).fit()>>>打印(结果.参数)拦截 14.952480乙 0.4011820.000352数据类型:float64>>>打印(结果.摘要())OLS 回归结果==============================================================================部变量:A R 平方:0.579型号:OLS Adj.R平方:0.158方法:最小二乘 F 统计量:1.375日期:2013 年 11 月 14 日星期四 概率(F 统计量):0.421时间:20:04:30 对数似然:-18.178编号. 观察:5 AIC:42.36Df 残差:2 BIC:41.19Df 型号:2==============================================================================coef std err t P>|t|[95.0% Conf.国际]-------------------------------------------------------------------------------拦截 14.9525 17.764 0.842 0.489 -61.481 91.386乙 0.4012 0.650 0.617 0.600 -2.394 3.197C 0.0004 0.001 0.650 0.583 -0.002 0.003==============================================================================综合:nan Durbin-Watson:1.061概率(综合):nan Jarque-Bera (JB):0.498偏斜:-0.123 概率(JB):0.780峰度:1.474 条件.5.21e+04号==============================================================================警告:[1] 条件数大,5.21e+04.这可能表明有强多重共线性或其他数值问题.

I have a pandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:

import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50], 
                   "B": [20, 30, 10, 40, 50], 
                   "C": [32, 234, 23, 23, 42523]})

Ideally, I would have something like ols(A ~ B + C, data = df) but when I look at the examples from algorithm libraries like scikit-learn it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. What is the most pythonic way to run an OLS regression (or any machine learning algorithm more generally) on data in a pandas data frame?

解决方案

I think you can almost do exactly what you thought would be ideal, using the statsmodels package which was one of pandas' optional dependencies before pandas' version 0.20.0 (it was used for a few things in pandas.stats.)

>>> import pandas as pd
>>> import statsmodels.formula.api as sm
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> result = sm.ols(formula="A ~ B + C", data=df).fit()
>>> print(result.params)
Intercept    14.952480
B             0.401182
C             0.000352
dtype: float64
>>> print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      A   R-squared:                       0.579
Model:                            OLS   Adj. R-squared:                  0.158
Method:                 Least Squares   F-statistic:                     1.375
Date:                Thu, 14 Nov 2013   Prob (F-statistic):              0.421
Time:                        20:04:30   Log-Likelihood:                -18.178
No. Observations:                   5   AIC:                             42.36
Df Residuals:                       2   BIC:                             41.19
Df Model:                           2                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     14.9525     17.764      0.842      0.489       -61.481    91.386
B              0.4012      0.650      0.617      0.600        -2.394     3.197
C              0.0004      0.001      0.650      0.583        -0.002     0.003
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.061
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.498
Skew:                          -0.123   Prob(JB):                        0.780
Kurtosis:                       1.474   Cond. No.                     5.21e+04
==============================================================================

Warnings:
[1] The condition number is large, 5.21e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

这篇关于使用 Pandas 数据框运行 OLS 回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆