statsmodel中的MNLogit返回nan [英] MNLogit in statsmodel returning nan
问题描述
我试图在著名的虹膜数据集上使用statsmodels的MNLogit函数.当我尝试拟合模型时,得到:当前函数值:nan".这是我正在使用的代码:
I'm trying to use statsmodels' MNLogit function on the famous iris data set. I get: "Current function value: nan" when I try to fit a model. Here is the code I am using:
import statsmodels.api as st
iris = st.datasets.get_rdataset('iris','datasets')
y = iris.data.Species
x = iris.data.ix[:, 0:4]
x = st.add_constant(x, prepend = False)
mdl = st.MNLogit(y, x)
mdl_fit = mdl.fit()
print (mdl_fit.summary())
推荐答案
在鸢尾花示例中,我们可以完美地预测Setosa.这会导致Logit和MNLogit中(部分)完美分离的问题.
In the iris example we can perfectly predict Setosa. This causes problems with (partial) perfect separation in Logit and MNLogit.
完美的分离对于预测是有好处的,但是logit的参数变为无穷大.在这种情况下,我得到的是奇异矩阵错误,而不是使用statsmodels master相对较新版本的Nans(在Windows上).
Perfect separation is good for prediction, but the parameters of logit go to infinity. In this case I get a Singular Matrix error instead of Nans with a relatively recent version of statsmodels master (on Windows).
离散模型的默认优化器是Newton,当Hessian变为奇数时,该优化器将失败.其他不使用Hessian信息的优化器也可以完成优化.例如,使用"bfgs",我得到
The default optimizer for the discrete models is Newton which fails when the Hessian becomes singular. Other optimizers that don't use the information from the Hessian are able to finish the optimization. For example using 'bfgs', I get
>>> mdl_fit = mdl.fit(method='bfgs')
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.057112
Iterations: 35
Function evaluations: 37
Gradient evaluations: 37
e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\base\model.py:471: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
Setosa的预测概率本质上是(1、0、0),即它们是完美预测的
The predicted probabilities for Setosa are essentially (1, 0, 0), that is they are perfectly predicted
>>> fitted = mdl_fit.predict()
>>> fitted[y=='setosa'].min(0)
array([ 9.99497636e-01, 2.07389867e-11, 1.71740822e-38])
>>> fitted[y=='setosa'].max(0)
array([ 1.00000000e+00, 5.02363854e-04, 1.05778255e-20])
但是,由于完美的分离,无法识别参数,因此这些值主要取决于优化器的停止标准,并且标准误差非常大.
However, because of perfect separation the parameters are not identified, the values are determined mostly by the stopping criterion of the optimizer and the standard errors are very large.
>>> print(mdl_fit.summary())
MNLogit Regression Results
==============================================================================
Dep. Variable: Species No. Observations: 150
Model: MNLogit Df Residuals: 140
Method: MLE Df Model: 8
Date: Mon, 20 Jul 2015 Pseudo R-squ.: 0.9480
Time: 04:08:04 Log-Likelihood: -8.5668
converged: False LL-Null: -164.79
LLR p-value: 9.200e-63
=====================================================================================
Species=versicolor coef std err z P>|z| [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
Sepal.Length -1.4959 444.817 -0.003 0.997 -873.321 870.330
Sepal.Width -8.0560 282.766 -0.028 0.977 -562.267 546.155
Petal.Length 11.9301 374.116 0.032 0.975 -721.323 745.184
Petal.Width 1.7039 759.366 0.002 0.998 -1486.627 1490.035
const 1.6444 1550.515 0.001 0.999 -3037.309 3040.597
--------------------------------------------------------------------------------------
Species=virginica coef std err z P>|z| [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Sepal.Length -8.0348 444.835 -0.018 0.986 -879.896 863.827
Sepal.Width -15.8195 282.793 -0.056 0.955 -570.083 538.444
Petal.Length 22.1797 374.155 0.059 0.953 -711.152 755.511
Petal.Width 14.0603 759.384 0.019 0.985 -1474.304 1502.425
const -6.5053 1550.533 -0.004 0.997 -3045.494 3032.483
=====================================================================================
关于统计模型中的实现
Logit专门检查是否完美分离,并引发一个异常,可以选择将其减弱为警告. 对于诸如MNLogit之类的其他模型,还没有明确检查是否可以完美分离,主要是因为缺少良好的测试用例和易于识别的一般条件. (例如 https://github.com/statsmodels/statsmodels/issues/516 仍然开放)
Logit checks specifically for perfect separation and raises an Exception that can optionally be weakened to a Warning. For other models like MNLogit, there is not yet an explicit check for perfect separation, largely for the lack of good test cases and easily identifiable general conditions. (several issues like https://github.com/statsmodels/statsmodels/issues/516 are still open)
我的总体策略:
当收敛失败时,请尝试使用不同的优化器和不同的起始值(start_params
).如果某些优化器成功,则可能是一个困难的优化问题,可能是目标函数的曲率,缩放比例的解释变量差或类似的问题.一个有用的检查是使用诸如nm
或powell
之类的健壮优化器的参数估计值作为更严格的优化器(如newton
或bfgs
)的起始值.
When there is a convergence failure, then try different optimizers and different starting values (start_params
). If some optimizers succeed, then it might be a difficult optimization problem, either with the curvature of the objective function, badly scaled explanatory variables or similar. A useful check is to use the parameter estimates of robust optimizers like nm
or powell
as starting values for the optimizers that are more strict, like newton
or bfgs
.
如果某些优化程序收敛后结果仍然不佳,则可能是数据固有的问题,例如Logit,Probit和其他几个模型中的完美分离或奇异或接近奇异的设计矩阵.在这种情况下,必须更改模型.可以通过互联网搜索找到有关完美分离的建议.
If the results are still not good after convergence of some optimizers, then it might be an inherent problem with the data like perfect separation in Logit, Probit and several other models or a singular or near singular design matrix. In that case the model has to be changed. Recommendation for perfect separation can be found with a internet search.
这篇关于statsmodel中的MNLogit返回nan的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!