Python Pandas 中带有加权最小二乘法的意外标准错误 [英] Unexpected standard errors with weighted least squares in Python Pandas

查看:86
本文介绍了Python Pandas 中带有加权最小二乘法的意外标准错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python Pandas 中主要 OLS 类的代码,我正在寻求帮助以澄清执行加权 OLS 时报告的标准误差和 t-stats 使用的约定.

In the code for the main OLS class in Python Pandas, I am looking for help to clarify what conventions are used for the standard error and t-stats reported when weighted OLS is performed.

这是我的示例数据集,其中包含一些使用 Pandas 和直接使用 scikits.statsmodels WLS 的导入:

Here's my example data set, with some imports to use Pandas and to use scikits.statsmodels WLS directly:

import pandas
import numpy as np
from statsmodels.regression.linear_model import WLS

# Make some random data.
np.random.seed(42)
df = pd.DataFrame(np.random.randn(10, 3), columns=['a', 'b', 'weights'])

# Add an intercept term for direct use in WLS
df['intercept'] = 1 

# Add a number (I picked 10) to stabilize the weight proportions a little.
df['weights'] = df.weights + 10

# Fit the regression models.
pd_wls = pandas.ols(y=df.a, x=df.b, weights=df.weights)
sm_wls = WLS(df.a, df[['intercept','b']], weights=df.weights).fit()

我使用 %cpaste 在 IPython 中执行此操作,然后打印两个回归的摘要:

I use %cpaste to execute this in IPython and then print the summaries of both regressions:

In [226]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import pandas
:import numpy as np
:from statsmodels.regression.linear_model import WLS
:
:# Make some random data.
np:np.random.seed(42)
:df = pd.DataFrame(np.random.randn(10, 3), columns=['a', 'b', 'weights'])
:
:# Add an intercept term for direct use in WLS
:df['intercept'] = 1
:
:# Add a number (I picked 10) to stabilize the weight proportions a little.
:df['weights'] = df.weights + 10
:
:# Fit the regression models.
:pd_wls = pandas.ols(y=df.a, x=df.b, weights=df.weights)
:sm_wls = WLS(df.a, df[['intercept','b']], weights=df.weights).fit()
:--

In [227]: pd_wls
Out[227]:

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         10
Number of Degrees of Freedom:   2

R-squared:         0.2685
Adj R-squared:     0.1770

Rmse:              2.4125

F-stat (1, 8):     2.9361, p-value:     0.1250

Degrees of Freedom: model 1, resid 8

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     0.5768     1.0191       0.57     0.5869    -1.4206     2.5742
     intercept     0.5227     0.9079       0.58     0.5806    -1.2567     2.3021
---------------------------------End of Summary---------------------------------


In [228]: sm_wls.summ
sm_wls.summary      sm_wls.summary_old

In [228]: sm_wls.summary()
Out[228]:
<class 'statsmodels.iolib.summary.Summary'>
"""
                            WLS Regression Results
==============================================================================
Dep. Variable:                      a   R-squared:                       0.268
Model:                            WLS   Adj. R-squared:                  0.177
Method:                 Least Squares   F-statistic:                     2.936
Date:                Wed, 17 Jul 2013   Prob (F-statistic):              0.125
Time:                        15:14:04   Log-Likelihood:                -10.560
No. Observations:                  10   AIC:                             25.12
Df Residuals:                       8   BIC:                             25.72
Df Model:                           1
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
intercept      0.5227      0.295      1.770      0.115        -0.158     1.204
b              0.5768      0.333      1.730      0.122        -0.192     1.346
==============================================================================
Omnibus:                        0.967   Durbin-Watson:                   1.082
Prob(Omnibus):                  0.617   Jarque-Bera (JB):                0.622
Skew:                           0.003   Prob(JB):                        0.733
Kurtosis:                       1.778   Cond. No.                         1.90
==============================================================================
"""

注意不匹配的标准错误:Pandas 声称标准错误是 [0.9079, 1.0191] 而 statsmodels 说的是 [0.295, 0.333].

Notice the mismatching standard errors: Pandas claims the standard errors are [0.9079, 1.0191] while statsmodels says [0.295, 0.333].

回到我在帖子顶部链接的代码 我试图追踪不匹配的来源.

Back in the code I linked at the top of the post I tried to track where the mismatch comes from.

首先,您可以看到标准错误是由函数报告的:

First, you can see that the standard errors are reports by the function:

def _std_err_raw(self):
    """Returns the raw standard err values."""
    return np.sqrt(np.diag(self._var_beta_raw))

所以看着 self._var_beta_raw 我发现:

def _var_beta_raw(self):
    """
    Returns the raw covariance of beta.
    """
    x = self._x.values
    y = self._y.values

    xx = np.dot(x.T, x)

    if self._nw_lags is None:
        return math.inv(xx) * (self._rmse_raw ** 2)
    else:
        resid = y - np.dot(x, self._beta_raw)
        m = (x.T * resid).T

        xeps = math.newey_west(m, self._nw_lags, self._nobs, self._df_raw,
                               self._nw_overlap)

        xx_inv = math.inv(xx)
        return np.dot(xx_inv, np.dot(xeps, xx_inv))

在我的用例中,self._nw_lags 将始终是 None,所以这是令人费解的第一部分.由于 xx 只是回归矩阵的标准乘积:x.T.dot(x),我想知道权重如何影响它.术语 self._rmse_raw 直接来自在 OLS 的构造函数中拟合的 statsmodels 回归,因此绝对包含权重.

In my use case, self._nw_lags will be None always, so it's the first part that's puzzling. Since xx is just the standard product of the regressor matrix: x.T.dot(x), I'm wondering how the weights affect this. The term self._rmse_raw comes directly from the statsmodels regression being fitted in the constructor of OLS, so that most definitely incorporates the weights.

这会提示以下问题:

  1. 为什么报告的标准误差是在 RMSE 部分应用了权重,而不是在回归变量上.
  2. 如果您想要未转换的"变量,这是标准做法吗(那么您是否也想要未转换的 RMSE??)有没有办法让 Pandas 返回标准的完全加权版本错误?
  3. 为什么总是误导?在构造函数中,计算了完整的 statsmodels 拟合回归.为什么不是绝对每个汇总统计数据都直接来自那里?为什么混合匹配,有​​些来自 statsmodels 输出,有些来自 Pandas 自制计算?

看起来我可以通过执行以下操作来协调 Pandas 输出:

It looks like I can reconcile the Pandas output by doing the following:

In [238]: xs = df[['intercept', 'b']]

In [239]: trans_xs = xs.values * np.sqrt(df.weights.values[:,None])

In [240]: trans_xs
Out[240]:
array([[ 3.26307961, -0.45116742],
       [ 3.12503809, -0.73173821],
       [ 3.08715494,  2.36918991],
       [ 3.08776136, -1.43092325],
       [ 2.87664425, -5.50382662],
       [ 3.21158019, -3.25278836],
       [ 3.38609639, -4.78219647],
       [ 2.92835309,  0.19774643],
       [ 2.97472796,  0.32996453],
       [ 3.1158155 , -1.87147934]])

In [241]: np.sqrt(np.diag(np.linalg.inv(trans_xs.T.dot(trans_xs)) * (pd_wls._rmse_raw ** 2)))
Out[241]: array([ 0.29525952,  0.33344823])

我只是对这种关系感到非常困惑.这是统计学家中常见的:在RMSE部分涉及权重,然后在计算系数的标准误时选择是否对变量进行加权?如果是这样,为什么 Pandas 和 statsmodels 之间的系数本身也不会不同,因为它们类似地源自首先由 statsmodels 转换的变量?

I'm just very confused by this relationship. Is this what is common among statisticians: involving the weights with the RMSE part, but then choosing whether or not to weight the variables when calculating standard error of the coefficient? If that's the case, why wouldn't the coefficients themselves also be different between Pandas and statsmodels, since those are similarly derived from variables first transformed by statsmodels?

作为参考,这里是我的玩具示例中使用的完整数据集(以防 np.random.seed 不足以使其可重现):

For reference, here was the full data set used in my toy example (in case np.random.seed isn't sufficient to make it reproducible):

In [242]: df
Out[242]:
          a         b    weights  intercept
0  0.496714 -0.138264  10.647689          1
1  1.523030 -0.234153   9.765863          1
2  1.579213  0.767435   9.530526          1
3  0.542560 -0.463418   9.534270          1
4  0.241962 -1.913280   8.275082          1
5 -0.562288 -1.012831  10.314247          1
6 -0.908024 -1.412304  11.465649          1
7 -0.225776  0.067528   8.575252          1
8 -0.544383  0.110923   8.849006          1
9  0.375698 -0.600639   9.708306          1

推荐答案

这里不直接回答你的问题,但一般来说,你应该更喜欢 statsmodels 代码而不是 pandas 进行建模.最近在 statsmodels 中发现了一些 WLS 问题,现在已修复.AFAIK,它们也在 pandas 中得到了修复,但在大多数情况下,pandas 建模代码没有得到维护,中期目标是确保 Pandas 中可用的所有内容都已弃用并已移至 statsmodels(statsmodels 的下一个版本 0.6.0应该这样做).

Not directly answering your question here, but, in general, you should prefer the statsmodels code to pandas for modeling. There were some recently discovered problems with WLS in statsmodels that are now fixed. AFAIK, they were also fixed in pandas, but for the most part the pandas modeling code is not maintained and the medium term goal is to make sure everything available in pandas is deprecated and has been moved to statsmodels (next release 0.6.0 for statsmodels should do it).

更清楚一点,pandas 现在是 statsmodels 的依赖项.您可以将 DataFrames 传递给 statsmodels 或在 statsmodels 中使用公式.这是未来的预期关系.

To be a little clearer, pandas is now a dependency of statsmodels. You can pass DataFrames to statsmodels or use formulas in statsmodels. This is the intended relationship going forward.

这篇关于Python Pandas 中带有加权最小二乘法的意外标准错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆