将预测值和残差附加到 pandas 数据帧 [英] Appending predicted values and residuals to pandas dataframe

查看:308
本文介绍了将预测值和残差附加到 pandas 数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

将运行回归的预测值和残差附加到数据框作为不同列是一个有用的常见做法。我是大熊猫的新手,我在执行这个非常简单的操作时遇到麻烦。我知道我缺少一些明显的东西。有一年的一个非常相似的问题这个数据框目前看起来像这样:



pre> y x1 x2
880.37 3.17 23
716.20 4.76 26
974.79 4.17 73
322.80 8.70 72
1054.25 11.45 16

所有我想要的是返回一个具有预测值和残差的数据帧,y = x1 + x2每次观察:

  y x1 x2 y_hat res 
880.37 3.17 23 840.27 40.10
716.20 4.76 26 752.60 -36.40
974.79 4.17 73 877.49 97.30
322.80 8.70 72 348.50 -25.70
1054.25 11.45 16 815.15 239.10

我已经尝试使用statsmodels解决这个问题和熊猫,并没有能够解决它。感谢提前!

解决方案

以下是使用来自statsmodels而不是大熊猫模型的OLS模型的亚历山大答案的变体。我们可以使用模型中的公式或数组/ DataFrame界面。



fitvalues resid 是具有正确索引的熊猫系列。
预测不返回大熊猫系列。

  import numpy as np 
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df = pd.DataFrame({'x1 ':[3.17,4.76,4.17,8.70,11.45],
'x2':[23,26,73,72,16],
'y':[880.37,716.20,974.79,322.80 ,1054.25]},
index = np.arange(10,20,2))

result = smf.ols('y〜x1 + x2',df).fit
df ['yhat'] = result.fittedvalues
df ['resid'] = result.resid


result2 = sm.OLS(df ['y '],sm.add_constant(df [['x1','x2']]))。fit()
df ['yhat2'] = result2.fittedvalues
df ['resid2'] = result2.resid

#预测不返回大熊猫系列,没有索引可用
df ['predict'] = result.predict(df)

打印(df)

x1 x2 y yhat resid yhat2 re sid2 \
10 3.17 23 880.37 923.949309 -43.579309 923.949309 -43.579309
12 4.76 26 716.20 890.732201 -174.532201 890.732201 -174.532201
14 4.17 73 974.79 656.155079 318.634921 656.155079 318.634921
16 8.70 72 322.80 610.510952 -287.710952 610.510952 -287.710952
18 11.45 16 1054.25 867.062458 187.187542 867.062458 187.187542

预计
10 923.949309
12 890.732201
14 656.155079
16 610.510952
18 867.062458

作为预览,模型结果中有一个扩展预测方法在statsmodels master(0.7)中,但是API尚未解决:

 >>> print(result.get_prediction()。summary_frame())
均值mean_se mean_ci_lower mean_ci_upper obs_ci_lower \
10 923.949309 268.931939 -233.171432 2081.070051 -991.466820
12 890.732201 211.945165 -21.194241 1802.658643 -887.328646
14 656.155079 269.136102 -501.844105 1814.154263 -1259.791854
16 610.510952 282.182030 -603.620329 1824.642233 -1339.874985
18 867.062458 329.017262 -548.584564 2282.709481 -1214.750941

obs_ci_upper
10 2839.365439
12 2668.793048
14 2572.102012
16 2560.896890
18 2948.875858


It's a useful and common practice to append predicted values and residuals from running a regression onto a dataframe as distinct columns. I'm new to pandas, and I'm having trouble performing this very simple operation. I know I'm missing something obvious. There was a very similar question asked about a year-and-a-half ago, but it wasn't really answered.

The dataframe currently looks something like this:

y               x1           x2   
880.37          3.17         23
716.20          4.76         26
974.79          4.17         73
322.80          8.70         72
1054.25         11.45        16

And all I'm wanting is to return a dataframe that has the predicted value and residual from y = x1 + x2 for each observation:

y               x1           x2       y_hat         res
880.37          3.17         23       840.27        40.10
716.20          4.76         26       752.60        -36.40
974.79          4.17         73       877.49        97.30
322.80          8.70         72       348.50        -25.70
1054.25         11.45        16       815.15        239.10

I've tried resolving this using statsmodels and pandas and haven't been able to solve it. Thanks in advance!

解决方案

Here is a variation on Alexander's answer using the OLS model from statsmodels instead of the pandas ols model. We can use either the formula or the array/DataFrame interface to the models.

fittedvalues and resid are pandas Series with the correct index. predict does not return a pandas Series.

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df = pd.DataFrame({'x1': [3.17, 4.76, 4.17, 8.70, 11.45],
                   'x2': [23, 26, 73, 72, 16],
                   'y': [880.37, 716.20, 974.79, 322.80, 1054.25]},
                   index=np.arange(10, 20, 2))

result = smf.ols('y ~ x1 + x2', df).fit()
df['yhat'] = result.fittedvalues
df['resid'] = result.resid


result2 = sm.OLS(df['y'], sm.add_constant(df[['x1', 'x2']])).fit()
df['yhat2'] = result2.fittedvalues
df['resid2'] = result2.resid

# predict doesn't return pandas series and no index is available
df['predicted'] = result.predict(df)

print(df)

       x1  x2        y        yhat       resid       yhat2      resid2  \
10   3.17  23   880.37  923.949309  -43.579309  923.949309  -43.579309   
12   4.76  26   716.20  890.732201 -174.532201  890.732201 -174.532201   
14   4.17  73   974.79  656.155079  318.634921  656.155079  318.634921   
16   8.70  72   322.80  610.510952 -287.710952  610.510952 -287.710952   
18  11.45  16  1054.25  867.062458  187.187542  867.062458  187.187542   

     predicted  
10  923.949309  
12  890.732201  
14  656.155079  
16  610.510952  
18  867.062458  

As preview, there is an extended prediction method in the model results in statsmodels master (0.7), but the API is not yet settled:

>>> print(result.get_prediction().summary_frame())
          mean     mean_se  mean_ci_lower  mean_ci_upper  obs_ci_lower  \
10  923.949309  268.931939    -233.171432    2081.070051   -991.466820   
12  890.732201  211.945165     -21.194241    1802.658643   -887.328646   
14  656.155079  269.136102    -501.844105    1814.154263  -1259.791854   
16  610.510952  282.182030    -603.620329    1824.642233  -1339.874985   
18  867.062458  329.017262    -548.584564    2282.709481  -1214.750941   

    obs_ci_upper  
10   2839.365439  
12   2668.793048  
14   2572.102012  
16   2560.896890  
18   2948.875858  

这篇关于将预测值和残差附加到 pandas 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆