在Python中线性回归失败,因变量中的值较大 [英] Linear regression fails in Python with large values in dependent variables
问题描述
我正在尝试使用Python(使用pandas.stats.api.ols)重写Python(在Stata中)的预测模型,并遇到线性回归问题: pandas计算出的系数和截距与来自Stata的产品不匹配.
I'm trying to rewrite a forecasting model (in Stata) using Python (with pandas.stats.api.ols), and ran into an issue with linear regression: the coefficients and intercept computed by pandas do not match with those from Stata.
调查表明,根本原因可能是从属值的值很大.基于以下发现,我对此表示怀疑:
Investigation shows that the root cause might be the values of the dependent values are very big. I've this suspicion based on the findings below:
1)我用Python创建了一个简单的DataFrame并对其进行了线性回归:
1) I created a simple DataFrame in Python and ran linear regression with it:
from pandas.stats.api import ols
import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50,21], "B": [20, 30, 10, 40, 50,98], "C": [32, 234, 23, 23, 31,21], "D":[12,28,12,98,51,87], "E": [1,8,12,9,12,91]})
ols(y=df['A'], x=df[['B','C', 'D', 'E']])
LR的摘要是:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <B> + <C> + <D> + <E> + <intercept>
Number of Observations: 6
Number of Degrees of Freedom: 5
R-squared: 0.4627
Adj R-squared: -1.6865
Rmse: 23.9493
F-stat (4, 1): 0.2153, p-value: 0.9026
Degrees of Freedom: model 4, resid 1
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
B 0.3212 1.1176 0.29 0.8218 -1.8693 2.5117
C -0.0488 0.1361 -0.36 0.7806 -0.3155 0.2178
D 0.1512 0.4893 0.31 0.8092 -0.8077 1.1101
E -0.4508 0.8268 -0.55 0.6822 -2.0713 1.1697
intercept 20.9222 23.6280 0.89 0.5386 -25.3887 67.2331
---------------------------------End of Summary---------------------------------
我将此DataFrame保存到Stata .dta文件中,并在Stata中以以下方式运行LR:
I saved this DataFrame to a Stata .dta file, and ran LR in Stata as:
use "/tmp/lr.dta", clear
reg A B C D E
结果相同:
Source | SS df MS Number of obs = 6
-------------+------------------------------ F( 4, 1) = 0.22
Model | 493.929019 4 123.482255 Prob > F = 0.9026
Residual | 573.570981 1 573.570981 R-squared = 0.4627
-------------+------------------------------ Adj R-squared = -1.6865
Total | 1067.5 5 213.5 Root MSE = 23.949
------------------------------------------------------------------------------
A | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
B | .3211939 1.117591 0.29 0.822 -13.87914 14.52153
C | -.0488429 .1360552 -0.36 0.781 -1.777589 1.679903
D | .1512067 .4892539 0.31 0.809 -6.065353 6.367766
E | -.4508122 .8267897 -0.55 0.682 -10.95617 10.05455
_cons | 20.9222 23.62799 0.89 0.539 -279.2998 321.1442
------------------------------------------------------------------------------
我在R中尝试了此操作,并得到了相同的结果.
I tried this in R, and got the same result.
2)但是,如果我增加了Python中因变量的值:
2) However, if I increased the values of the dependent variables in Python:
df = pd.DataFrame({"A": [10.0,20.0,30.0,40.0,50.0,21.0]})
df['B'] = pow(df['A'], 30)
df['C'] = pow(df['A'], 5)
df['D'] = pow(df['A'], 15)
df['E'] = pow(df['A'], 25)
我确保所有列都在这里使用float64:df.dtypes一个float64B浮点64C语言float64D float64E float64dtype:对象
I've made sure all the columns are using float64 here: df.dtypes A float64 B float64 C float64 D float64 E float64 dtype: object
我得到的结果是:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <B> + <C> + <D> + <E> + <intercept>
Number of Observations: 6
Number of Degrees of Freedom: 2
R-squared: -0.7223
Adj R-squared: -1.1528
Rmse: 21.4390
F-stat (4, 4): -1.6775, p-value: 1.0000
Degrees of Freedom: model 1, resid 4
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
B -0.0000 0.0000 -0.00 0.9973 -0.0000 0.0000
C 0.0000 0.0000 0.00 1.0000 -0.0000 0.0000
D 0.0000 0.0000 0.00 1.0000 -0.0000 0.0000
E 0.0000 0.0000 0.00 0.9975 -0.0000 0.0000
intercept 0.0000 21.7485 0.00 1.0000 -42.6271 42.6271
---------------------------------End of Summary---------------------------------
但是在Stata中,我得到了截然不同的结果:
But in Stata, I got a very different result:
Source | SS df MS Number of obs = 6
-------------+------------------------------ F( 4, 1) = 237.35
Model | 1066.37679 4 266.594196 Prob > F = 0.0486
Residual | 1.1232144 1 1.1232144 R-squared = 0.9989
-------------+------------------------------ Adj R-squared = 0.9947
Total | 1067.5 5 213.5 Root MSE = 1.0598
------------------------------------------------------------------------------
A | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
B | -1.45e-45 2.32e-46 -6.24 0.101 -4.40e-45 1.50e-45
C | 2.94e-06 3.67e-07 8.01 0.079 -1.72e-06 7.61e-06
D | -3.86e-21 6.11e-22 -6.31 0.100 -1.16e-20 3.90e-21
E | 4.92e-37 7.88e-38 6.24 0.101 -5.09e-37 1.49e-36
_cons | 9.881129 1.07512 9.19 0.069 -3.779564 23.54182
------------------------------------------------------------------------------
R中的结果与Stata对齐:lm(公式= A〜B + C + D + E,数据=地层)
And the result in R aligns with Stata: lm(formula = A ~ B + C + D + E, data = stata)
Residuals:
1 2 3 4 5 6
-1.757e-01 8.211e-01 1.287e-03 -1.269e-06 1.289e-09 -6.467e-01
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.881e+00 1.075e+00 9.191 0.069 .
B -1.449e-45 2.322e-46 -6.238 0.101
C 2.945e-06 3.674e-07 8.015 0.079 .
D -3.855e-21 6.106e-22 -6.313 0.100
E 4.919e-37 7.879e-38 6.243 0.101
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.06 on 1 degrees of freedom
Multiple R-squared: 0.9989, Adjusted R-squared: 0.9947
F-statistic: 237.3 on 4 and 1 DF, p-value: 0.04864
因此,在我看来,熊猫在这里有问题.有人可以帮忙提供建议吗?
Hence, it appears to me that pandas is having some issue here. Could anyone please help advise?
推荐答案
我认为这是因为python中的相对精度问题(不仅在python中,而且在大多数其他编程语言中也是如此,例如C ++). np.finfo(float).eps
给出了 2.2204460492503131e-16
,因此所有小于 eps * max_value_of_your_data
的内容都将被视为当您尝试任何原始操作(例如
.例如, +-*/
)时为0 1e117 + 1e100 == 1e117
返回 True
,因为 1e100/1e117 = 1e-17<eps
.现在查看您的数据.
I think this is because of the relative precision issue in python (not just in python, but most other programming languages as well, like C++). np.finfo(float).eps
gives 2.2204460492503131e-16
, so everything less than eps*max_value_of_your_data
will be treated essentially as 0
when you try any primitive operations like + - * /
. For example, 1e117 + 1e100 == 1e117
returns True
, because 1e100/1e117 = 1e-17 < eps
. Now look at your data.
# your data
# =======================
print(df)
A B C D E
0 10 1.0000e+30 100000 1.0000e+15 1.0000e+25
1 20 1.0737e+39 3200000 3.2768e+19 3.3554e+32
2 30 2.0589e+44 24300000 1.4349e+22 8.4729e+36
3 40 1.1529e+48 102400000 1.0737e+24 1.1259e+40
4 50 9.3132e+50 312500000 3.0518e+25 2.9802e+42
5 21 4.6407e+39 4084101 6.8122e+19 1.1363e+33
考虑相对精度后,
# ===================================================
import numpy as np
np.finfo(float).eps # 2.2204460492503131e-16
df[df < df.max().max()*np.finfo(float).eps] = 0
df
A B C D E
0 0 0.0000e+00 0 0 0.0000e+00
1 0 1.0737e+39 0 0 0.0000e+00
2 0 2.0589e+44 0 0 8.4729e+36
3 0 1.1529e+48 0 0 1.1259e+40
4 0 9.3132e+50 0 0 2.9802e+42
5 0 4.6407e+39 0 0 0.0000e+00
所以 y(A)
根本没有变化,这就是 statsmodels
返回所有0系数的原因.提醒一下,在运行回归之前首先规范化数据始终是个好习惯.
So there is no variation at all in y(A)
, and that's why statsmodels
returns all 0 coefficients. As a reminder, it's always a good practice to normalize your data first before running regression.
这篇关于在Python中线性回归失败,因变量中的值较大的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!