python包中的statmodels,如何精确处理重复的功能? [英] statmodels in python package, How exactly duplicated features are handled?

查看:195
本文介绍了python包中的statmodels,如何精确处理重复的功能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是R的重度用户,最近正在学习python. 我对statsmodels.api如何处理重复功能有疑问. 以我的理解,此函数是R包中glm的python版本.因此,我期望该函数返回最大似然估计(MLE).

我的问题是statsmodels使用哪种算法来获得MLE? 尤其是算法如何处理具有重复特征的情况?

为澄清我的问题,我从Bernoullie分布中生成了一个大小为50的样本,其中有一个协变量x1.

import statsmodels.api as sm
import pandas as pd
import numpy as np
def ilogit(eta):
    return 1.0 - 1.0/(np.exp(eta)+1)

## generate samples
Nsample = 50
cov = {}
cov["x1"] = np.random.normal(0,1,Nsample)
cov = pd.DataFrame(cov)
true_value = 0.5
resp = {}
resp["FAIL"] =   np.random.binomial(1, ilogit(true_value*cov["x1"]))
resp = pd.DataFrame(resp)
resp["NOFAIL"] = 1 - resp["FAIL"]

然后将逻辑回归拟合为:

## fit logistic regrssion 
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()

这将返回:

估计的系数或多或少类似于真实值(= 0.5). 然后,我创建一个重复的列,即x2,并再次拟合逻辑回归模型. (R包中的glm将返回x2的NA)

cov["x2"] = cov["x1"]
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()

这将输出:

令人惊讶的是,这项工作有效,并且x1和x2的系数估计值完全相同(= 0.1182).当先前拟合返回系数估计值x1 = 0.2364时,该估计值减半. 然后,我将重复特征的数量增加到9,并拟合模型:

cov = cov
for icol in range(3,10):
    cov["x"+str(icol)] = cov["x1"]
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()

与预期的一样,每个重复变量的估计值都相同(0.0263),似乎比x1的原始估计值(0.2364)小9倍.

我对最大似然估计的这种意外行为感到惊讶.您能解释一下为什么会这样吗,以及statsmodels.api后面采用了哪种算法?

解决方案

简短答案:

在这种情况下,GLM使用的是Moore-Penrose广义逆pinv,它对应于主成分回归,其中删除了特征值为零的成分.特征值零由numpy.linalg.pinv中的默认阈值(rcond)定义.

statsmodels没有针对共线性的系统策略.当矩阵求逆失败时,一些非线性优化例程会引发异常.但是,线性回归模型OLS和WLS默认情况下使用广义逆,在这种情况下,我们可以看到上面的行为.

GLM.fit中的默认优化算法是迭代加权最小二乘法irls,它使用WLS并继承了WLS对于奇异设计矩阵的默认行为. statsmodels主版本中的版本还可以选择使用标准的scipy优化器,其中关于奇异或接近奇异设计矩阵的行为将取决于优化算法的细节.

I am a heavy R user and am recently learning python. I have a question about how statsmodels.api handles duplicated features. In my understanding, this function is a python version of glm in R package. So I am expecting that the function returns the maximum likelihood estimates (MLE).

My question is which algorithm is statsmodels employ to obtain MLE? Especially how is the algorithm handling the situation with duplicated features?

To clarify my question, I generate a sample of size 50 from Bernoullie distribution with a single covariate x1.

import statsmodels.api as sm
import pandas as pd
import numpy as np
def ilogit(eta):
    return 1.0 - 1.0/(np.exp(eta)+1)

## generate samples
Nsample = 50
cov = {}
cov["x1"] = np.random.normal(0,1,Nsample)
cov = pd.DataFrame(cov)
true_value = 0.5
resp = {}
resp["FAIL"] =   np.random.binomial(1, ilogit(true_value*cov["x1"]))
resp = pd.DataFrame(resp)
resp["NOFAIL"] = 1 - resp["FAIL"]

Then fit the logistic regression as:

## fit logistic regrssion 
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()

This returns:

The estimated coefficient is more or less similar to the true value (=0.5). Then I create a duplicate column, namely x2, and fit the logistic regression model again. (glm in R package would return NA for x2)

cov["x2"] = cov["x1"]
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()

This outputs:

Surprisingly, this works and coefficient estimates of x1 and x2 are exactly identical (=0.1182). As the previous fit returns the coefficient estimate of x1 = 0.2364, the estimate was halved. Then I increase the number of duplicated features to 9 and fit the model:

cov = cov
for icol in range(3,10):
    cov["x"+str(icol)] = cov["x1"]
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()

As expected, the estimates of each duplicated variable are the same (0.0263) and they seem to be 9 times smaller than the original estimate for x1 (0.2364).

I am surprised with this unexpected behaviour of maximum likelihood estimates. Could you explain why this is happening and also what kind of algorithms are employed behind statsmodels.api?

解决方案

The short answer:

GLM is using the Moore-Penrose generalized inverse, pinv, in this case, which corresponds to a principal component regression where components with zero eigenvalues are dropped. zero eigenvalue is defined by the default threshold (rcond) in numpy.linalg.pinv.

statsmodels does not have a systematic policy towards collinearity. Some nonlinear optimization routines raise an exception when the matrix inverse fails. However, the linear regression models, OLS and WLS, use the generalized inverse by default, in which case we see the behavior as above.

The default optimization algorithm in GLM.fit is iteratively reweighted least squares irls which uses WLS and inherits the default behavior of WLS for singular design matrices. The version in statsmodels master has also the option of using the standard scipy optimizers where the behavior with respect to singular or near singular design matrices will depend on the details of the optimization algorithm.

这篇关于python包中的statmodels,如何精确处理重复的功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆