Python Pandas 中带有加权最小二乘法的意外标准错误 [英] Unexpected standard errors with weighted least squares in Python Pandas
问题描述
在 Python Pandas 中主要 OLS 类的代码,我正在寻求帮助以澄清执行加权 OLS 时报告的标准误差和 t-stats 使用的约定.
In the code for the main OLS class in Python Pandas, I am looking for help to clarify what conventions are used for the standard error and t-stats reported when weighted OLS is performed.
这是我的示例数据集,其中包含一些使用 Pandas 和直接使用 scikits.statsmodels WLS 的导入:
Here's my example data set, with some imports to use Pandas and to use scikits.statsmodels WLS directly:
import pandas
import numpy as np
from statsmodels.regression.linear_model import WLS
# Make some random data.
np.random.seed(42)
df = pd.DataFrame(np.random.randn(10, 3), columns=['a', 'b', 'weights'])
# Add an intercept term for direct use in WLS
df['intercept'] = 1
# Add a number (I picked 10) to stabilize the weight proportions a little.
df['weights'] = df.weights + 10
# Fit the regression models.
pd_wls = pandas.ols(y=df.a, x=df.b, weights=df.weights)
sm_wls = WLS(df.a, df[['intercept','b']], weights=df.weights).fit()
我使用 %cpaste
在 IPython 中执行此操作,然后打印两个回归的摘要:
I use %cpaste
to execute this in IPython and then print the summaries of both regressions:
In [226]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import pandas
:import numpy as np
:from statsmodels.regression.linear_model import WLS
:
:# Make some random data.
np:np.random.seed(42)
:df = pd.DataFrame(np.random.randn(10, 3), columns=['a', 'b', 'weights'])
:
:# Add an intercept term for direct use in WLS
:df['intercept'] = 1
:
:# Add a number (I picked 10) to stabilize the weight proportions a little.
:df['weights'] = df.weights + 10
:
:# Fit the regression models.
:pd_wls = pandas.ols(y=df.a, x=df.b, weights=df.weights)
:sm_wls = WLS(df.a, df[['intercept','b']], weights=df.weights).fit()
:--
In [227]: pd_wls
Out[227]:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 10
Number of Degrees of Freedom: 2
R-squared: 0.2685
Adj R-squared: 0.1770
Rmse: 2.4125
F-stat (1, 8): 2.9361, p-value: 0.1250
Degrees of Freedom: model 1, resid 8
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 0.5768 1.0191 0.57 0.5869 -1.4206 2.5742
intercept 0.5227 0.9079 0.58 0.5806 -1.2567 2.3021
---------------------------------End of Summary---------------------------------
In [228]: sm_wls.summ
sm_wls.summary sm_wls.summary_old
In [228]: sm_wls.summary()
Out[228]:
<class 'statsmodels.iolib.summary.Summary'>
"""
WLS Regression Results
==============================================================================
Dep. Variable: a R-squared: 0.268
Model: WLS Adj. R-squared: 0.177
Method: Least Squares F-statistic: 2.936
Date: Wed, 17 Jul 2013 Prob (F-statistic): 0.125
Time: 15:14:04 Log-Likelihood: -10.560
No. Observations: 10 AIC: 25.12
Df Residuals: 8 BIC: 25.72
Df Model: 1
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
intercept 0.5227 0.295 1.770 0.115 -0.158 1.204
b 0.5768 0.333 1.730 0.122 -0.192 1.346
==============================================================================
Omnibus: 0.967 Durbin-Watson: 1.082
Prob(Omnibus): 0.617 Jarque-Bera (JB): 0.622
Skew: 0.003 Prob(JB): 0.733
Kurtosis: 1.778 Cond. No. 1.90
==============================================================================
"""
注意不匹配的标准错误:Pandas 声称标准错误是 [0.9079, 1.0191]
而 statsmodels 说的是 [0.295, 0.333].
Notice the mismatching standard errors: Pandas claims the standard errors are [0.9079, 1.0191]
while statsmodels says [0.295, 0.333].
回到我在帖子顶部链接的代码 我试图追踪不匹配的来源.
Back in the code I linked at the top of the post I tried to track where the mismatch comes from.
首先,您可以看到标准错误是由函数报告的:
First, you can see that the standard errors are reports by the function:
def _std_err_raw(self):
"""Returns the raw standard err values."""
return np.sqrt(np.diag(self._var_beta_raw))
所以看着 self._var_beta_raw
我发现:
def _var_beta_raw(self):
"""
Returns the raw covariance of beta.
"""
x = self._x.values
y = self._y.values
xx = np.dot(x.T, x)
if self._nw_lags is None:
return math.inv(xx) * (self._rmse_raw ** 2)
else:
resid = y - np.dot(x, self._beta_raw)
m = (x.T * resid).T
xeps = math.newey_west(m, self._nw_lags, self._nobs, self._df_raw,
self._nw_overlap)
xx_inv = math.inv(xx)
return np.dot(xx_inv, np.dot(xeps, xx_inv))
在我的用例中,self._nw_lags
将始终是 None
,所以这是令人费解的第一部分.由于 xx
只是回归矩阵的标准乘积:x.T.dot(x)
,我想知道权重如何影响它.术语 self._rmse_raw
直接来自在 OLS
的构造函数中拟合的 statsmodels 回归,因此绝对包含权重.
In my use case, self._nw_lags
will be None
always, so it's the first part that's puzzling. Since xx
is just the standard product of the regressor matrix: x.T.dot(x)
, I'm wondering how the weights affect this. The term self._rmse_raw
comes directly from the statsmodels regression being fitted in the constructor of OLS
, so that most definitely incorporates the weights.
这会提示以下问题:
- 为什么报告的标准误差是在 RMSE 部分应用了权重,而不是在回归变量上.
- 如果您想要未转换的"变量,这是标准做法吗(那么您是否也想要未转换的 RMSE??)有没有办法让 Pandas 返回标准的完全加权版本错误?
- 为什么总是误导?在构造函数中,计算了完整的 statsmodels 拟合回归.为什么不是绝对每个汇总统计数据都直接来自那里?为什么混合匹配,有些来自 statsmodels 输出,有些来自 Pandas 自制计算?
看起来我可以通过执行以下操作来协调 Pandas 输出:
It looks like I can reconcile the Pandas output by doing the following:
In [238]: xs = df[['intercept', 'b']]
In [239]: trans_xs = xs.values * np.sqrt(df.weights.values[:,None])
In [240]: trans_xs
Out[240]:
array([[ 3.26307961, -0.45116742],
[ 3.12503809, -0.73173821],
[ 3.08715494, 2.36918991],
[ 3.08776136, -1.43092325],
[ 2.87664425, -5.50382662],
[ 3.21158019, -3.25278836],
[ 3.38609639, -4.78219647],
[ 2.92835309, 0.19774643],
[ 2.97472796, 0.32996453],
[ 3.1158155 , -1.87147934]])
In [241]: np.sqrt(np.diag(np.linalg.inv(trans_xs.T.dot(trans_xs)) * (pd_wls._rmse_raw ** 2)))
Out[241]: array([ 0.29525952, 0.33344823])
我只是对这种关系感到非常困惑.这是统计学家中常见的:在RMSE部分涉及权重,然后在计算系数的标准误时选择是否对变量进行加权?如果是这样,为什么 Pandas 和 statsmodels 之间的系数本身也不会不同,因为它们类似地源自首先由 statsmodels 转换的变量?
I'm just very confused by this relationship. Is this what is common among statisticians: involving the weights with the RMSE part, but then choosing whether or not to weight the variables when calculating standard error of the coefficient? If that's the case, why wouldn't the coefficients themselves also be different between Pandas and statsmodels, since those are similarly derived from variables first transformed by statsmodels?
作为参考,这里是我的玩具示例中使用的完整数据集(以防 np.random.seed
不足以使其可重现):
For reference, here was the full data set used in my toy example (in case np.random.seed
isn't sufficient to make it reproducible):
In [242]: df
Out[242]:
a b weights intercept
0 0.496714 -0.138264 10.647689 1
1 1.523030 -0.234153 9.765863 1
2 1.579213 0.767435 9.530526 1
3 0.542560 -0.463418 9.534270 1
4 0.241962 -1.913280 8.275082 1
5 -0.562288 -1.012831 10.314247 1
6 -0.908024 -1.412304 11.465649 1
7 -0.225776 0.067528 8.575252 1
8 -0.544383 0.110923 8.849006 1
9 0.375698 -0.600639 9.708306 1
推荐答案
这里不直接回答你的问题,但一般来说,你应该更喜欢 statsmodels 代码而不是 pandas 进行建模.最近在 statsmodels 中发现了一些 WLS 问题,现在已修复.AFAIK,它们也在 pandas 中得到了修复,但在大多数情况下,pandas 建模代码没有得到维护,中期目标是确保 Pandas 中可用的所有内容都已弃用并已移至 statsmodels(statsmodels 的下一个版本 0.6.0应该这样做).
Not directly answering your question here, but, in general, you should prefer the statsmodels code to pandas for modeling. There were some recently discovered problems with WLS in statsmodels that are now fixed. AFAIK, they were also fixed in pandas, but for the most part the pandas modeling code is not maintained and the medium term goal is to make sure everything available in pandas is deprecated and has been moved to statsmodels (next release 0.6.0 for statsmodels should do it).
更清楚一点,pandas 现在是 statsmodels 的依赖项.您可以将 DataFrames 传递给 statsmodels 或在 statsmodels 中使用公式.这是未来的预期关系.
To be a little clearer, pandas is now a dependency of statsmodels. You can pass DataFrames to statsmodels or use formulas in statsmodels. This is the intended relationship going forward.
这篇关于Python Pandas 中带有加权最小二乘法的意外标准错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!