在Python中线性回归失败,因变量中的值较大 [英] Linear regression fails in Python with large values in dependent variables

查看:63
本文介绍了在Python中线性回归失败,因变量中的值较大的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Python(使用pandas.stats.api.ols)重写Python(在Stata中)的预测模型,并遇到线性回归问题: pandas计算出的系数和截距与来自Stata的产品不匹配.

I'm trying to rewrite a forecasting model (in Stata) using Python (with pandas.stats.api.ols), and ran into an issue with linear regression: the coefficients and intercept computed by pandas do not match with those from Stata.

调查表明,根本原因可能是从属值的值很大.基于以下发现,我对此表示怀疑:

Investigation shows that the root cause might be the values of the dependent values are very big. I've this suspicion based on the findings below:

1)我用Python创建了一个简单的DataFrame并对其进行了线性回归:

1) I created a simple DataFrame in Python and ran linear regression with it:

from pandas.stats.api import ols
import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50,21], "B": [20, 30, 10, 40, 50,98], "C": [32, 234, 23, 23, 31,21], "D":[12,28,12,98,51,87], "E": [1,8,12,9,12,91]})
ols(y=df['A'], x=df[['B','C', 'D', 'E']])

LR的摘要是:

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <B> + <C> + <D> + <E> + <intercept>

Number of Observations:         6
Number of Degrees of Freedom:   5

R-squared:         0.4627
Adj R-squared:    -1.6865

Rmse:             23.9493

F-stat (4, 1):     0.2153, p-value:     0.9026

Degrees of Freedom: model 4, resid 1

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             B     0.3212     1.1176       0.29     0.8218    -1.8693     2.5117
             C    -0.0488     0.1361      -0.36     0.7806    -0.3155     0.2178
             D     0.1512     0.4893       0.31     0.8092    -0.8077     1.1101
             E    -0.4508     0.8268      -0.55     0.6822    -2.0713     1.1697
     intercept    20.9222    23.6280       0.89     0.5386   -25.3887    67.2331
---------------------------------End of Summary---------------------------------

我将此DataFrame保存到Stata .dta文件中,并在Stata中以以下方式运行LR:

I saved this DataFrame to a Stata .dta file, and ran LR in Stata as:

 use "/tmp/lr.dta", clear
 reg A B C D E

结果相同:

      Source |       SS       df       MS              Number of obs =       6
-------------+------------------------------           F(  4,     1) =    0.22
       Model |  493.929019     4  123.482255           Prob > F      =  0.9026
    Residual |  573.570981     1  573.570981           R-squared     =  0.4627
-------------+------------------------------           Adj R-squared = -1.6865
       Total |      1067.5     5       213.5           Root MSE      =  23.949

------------------------------------------------------------------------------
           A |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           B |   .3211939   1.117591     0.29   0.822    -13.87914    14.52153
           C |  -.0488429   .1360552    -0.36   0.781    -1.777589    1.679903
           D |   .1512067   .4892539     0.31   0.809    -6.065353    6.367766
           E |  -.4508122   .8267897    -0.55   0.682    -10.95617    10.05455
       _cons |    20.9222   23.62799     0.89   0.539    -279.2998    321.1442
------------------------------------------------------------------------------

我在R中尝试了此操作,并得到了相同的结果.

I tried this in R, and got the same result.

2)但是,如果我增加了Python中因变量的值:

2) However, if I increased the values of the dependent variables in Python:

df = pd.DataFrame({"A": [10.0,20.0,30.0,40.0,50.0,21.0]})
df['B'] = pow(df['A'], 30)
df['C'] = pow(df['A'], 5)
df['D'] = pow(df['A'], 15)
df['E'] = pow(df['A'], 25)

我确保所有列都在这里使用float64:df.dtypes一个float64B浮点64C语言float64D float64E float64dtype:对象

I've made sure all the columns are using float64 here: df.dtypes A float64 B float64 C float64 D float64 E float64 dtype: object

我得到的结果是:

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <B> + <C> + <D> + <E> + <intercept>

Number of Observations:         6
Number of Degrees of Freedom:   2

R-squared:        -0.7223
Adj R-squared:    -1.1528

Rmse:             21.4390

F-stat (4, 4):    -1.6775, p-value:     1.0000

Degrees of Freedom: model 1, resid 4

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             B    -0.0000     0.0000      -0.00     0.9973    -0.0000     0.0000
             C     0.0000     0.0000       0.00     1.0000    -0.0000     0.0000
             D     0.0000     0.0000       0.00     1.0000    -0.0000     0.0000
             E     0.0000     0.0000       0.00     0.9975    -0.0000     0.0000
     intercept     0.0000    21.7485       0.00     1.0000   -42.6271    42.6271
---------------------------------End of Summary---------------------------------

但是在Stata中,我得到了截然不同的结果:

But in Stata, I got a very different result:

      Source |       SS       df       MS              Number of obs =       6
-------------+------------------------------           F(  4,     1) =  237.35
       Model |  1066.37679     4  266.594196           Prob > F      =  0.0486
    Residual |   1.1232144     1   1.1232144           R-squared     =  0.9989
-------------+------------------------------           Adj R-squared =  0.9947
       Total |      1067.5     5       213.5           Root MSE      =  1.0598

------------------------------------------------------------------------------
           A |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           B |  -1.45e-45   2.32e-46    -6.24   0.101    -4.40e-45    1.50e-45
           C |   2.94e-06   3.67e-07     8.01   0.079    -1.72e-06    7.61e-06
           D |  -3.86e-21   6.11e-22    -6.31   0.100    -1.16e-20    3.90e-21
           E |   4.92e-37   7.88e-38     6.24   0.101    -5.09e-37    1.49e-36
       _cons |   9.881129    1.07512     9.19   0.069    -3.779564    23.54182
------------------------------------------------------------------------------

R中的结果与Stata对齐:lm(公式= A〜B + C + D + E,数据=地层)

And the result in R aligns with Stata: lm(formula = A ~ B + C + D + E, data = stata)

Residuals:
         1          2          3          4          5          6 
-1.757e-01  8.211e-01  1.287e-03 -1.269e-06  1.289e-09 -6.467e-01 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  9.881e+00  1.075e+00   9.191    0.069 .
B           -1.449e-45  2.322e-46  -6.238    0.101  
C            2.945e-06  3.674e-07   8.015    0.079 .
D           -3.855e-21  6.106e-22  -6.313    0.100  
E            4.919e-37  7.879e-38   6.243    0.101  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.06 on 1 degrees of freedom
Multiple R-squared:  0.9989,  Adjusted R-squared:  0.9947 
F-statistic: 237.3 on 4 and 1 DF,  p-value: 0.04864

因此,在我看来,熊猫在这里有问题.有人可以帮忙提供建议吗?

Hence, it appears to me that pandas is having some issue here. Could anyone please help advise?

推荐答案

我认为这是因为python中的相对精度问题(不仅在python中,而且在大多数其他编程语言中也是如此,例如C ++). np.finfo(float).eps 给出了 2.2204460492503131e-16 ,因此所有小于 eps * max_value_of_your_data 的内容都将被视为当您尝试任何原始操作(例如 +-*/)时为0 .例如, 1e117 + 1e100 == 1e117 返回 True ,因为 1e100/1e117 = 1e-17<eps .现在查看您的数据.

I think this is because of the relative precision issue in python (not just in python, but most other programming languages as well, like C++). np.finfo(float).eps gives 2.2204460492503131e-16, so everything less than eps*max_value_of_your_data will be treated essentially as 0 when you try any primitive operations like + - * /. For example, 1e117 + 1e100 == 1e117 returns True, because 1e100/1e117 = 1e-17 < eps. Now look at your data.

# your data
# =======================
print(df)

    A           B          C           D           E
0  10  1.0000e+30     100000  1.0000e+15  1.0000e+25
1  20  1.0737e+39    3200000  3.2768e+19  3.3554e+32
2  30  2.0589e+44   24300000  1.4349e+22  8.4729e+36
3  40  1.1529e+48  102400000  1.0737e+24  1.1259e+40
4  50  9.3132e+50  312500000  3.0518e+25  2.9802e+42
5  21  4.6407e+39    4084101  6.8122e+19  1.1363e+33

考虑相对精度后,

# ===================================================
import numpy as np

np.finfo(float).eps # 2.2204460492503131e-16

df[df < df.max().max()*np.finfo(float).eps] = 0
df

   A           B  C  D           E
0  0  0.0000e+00  0  0  0.0000e+00
1  0  1.0737e+39  0  0  0.0000e+00
2  0  2.0589e+44  0  0  8.4729e+36
3  0  1.1529e+48  0  0  1.1259e+40
4  0  9.3132e+50  0  0  2.9802e+42
5  0  4.6407e+39  0  0  0.0000e+00

所以 y(A)根本没有变化,这就是 statsmodels 返回所有0系数的原因.提醒一下,在运行回归之前首先规范化数据始终是个好习惯.

So there is no variation at all in y(A), and that's why statsmodels returns all 0 coefficients. As a reminder, it's always a good practice to normalize your data first before running regression.

这篇关于在Python中线性回归失败,因变量中的值较大的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆