statsmodel中的MNLogit返回nan [英] MNLogit in statsmodel returning nan

查看:143
本文介绍了statsmodel中的MNLogit返回nan的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在著名的虹膜数据集上使用statsmodels的MNLogit函数.当我尝试拟合模型时,得到:当前函数值:nan".这是我正在使用的代码:

I'm trying to use statsmodels' MNLogit function on the famous iris data set. I get: "Current function value: nan" when I try to fit a model. Here is the code I am using:

import statsmodels.api as st
iris = st.datasets.get_rdataset('iris','datasets')
y = iris.data.Species
x = iris.data.ix[:, 0:4]
x = st.add_constant(x, prepend = False)
mdl = st.MNLogit(y, x)
mdl_fit = mdl.fit()
print (mdl_fit.summary())

推荐答案

在鸢尾花示例中,我们可以完美地预测Setosa.这会导致Logit和MNLogit中(部分)完美分离的问题.

In the iris example we can perfectly predict Setosa. This causes problems with (partial) perfect separation in Logit and MNLogit.

完美的分离对于预测是有好处的,但是logit的参数变为无穷大.在这种情况下,我得到的是奇异矩阵错误,而不是使用statsmodels master相对较新版本的Nans(在Windows上).

Perfect separation is good for prediction, but the parameters of logit go to infinity. In this case I get a Singular Matrix error instead of Nans with a relatively recent version of statsmodels master (on Windows).

离散模型的默认优化器是Newton,当Hessian变为奇数时,该优化器将失败.其他不使用Hessian信息的优化器也可以完成优化.例如,使用"bfgs",我得到

The default optimizer for the discrete models is Newton which fails when the Hessian becomes singular. Other optimizers that don't use the information from the Hessian are able to finish the optimization. For example using 'bfgs', I get

>>> mdl_fit = mdl.fit(method='bfgs')
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.057112
         Iterations: 35
         Function evaluations: 37
         Gradient evaluations: 37
e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\base\model.py:471: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)

Setosa的预测概率本质上是(1、0、0),即它们是完美预测的

The predicted probabilities for Setosa are essentially (1, 0, 0), that is they are perfectly predicted

>>> fitted = mdl_fit.predict()
>>> fitted[y=='setosa'].min(0)
array([  9.99497636e-01,   2.07389867e-11,   1.71740822e-38])
>>> fitted[y=='setosa'].max(0)
array([  1.00000000e+00,   5.02363854e-04,   1.05778255e-20])

但是,由于完美的分离,无法识别参数,因此这些值主要取决于优化器的停止标准,并且标准误差非常大.

However, because of perfect separation the parameters are not identified, the values are determined mostly by the stopping criterion of the optimizer and the standard errors are very large.

>>> print(mdl_fit.summary())
                          MNLogit Regression Results                          
==============================================================================
Dep. Variable:                Species   No. Observations:                  150
Model:                        MNLogit   Df Residuals:                      140
Method:                           MLE   Df Model:                            8
Date:                Mon, 20 Jul 2015   Pseudo R-squ.:                  0.9480
Time:                        04:08:04   Log-Likelihood:                -8.5668
converged:                      False   LL-Null:                       -164.79
                                        LLR p-value:                 9.200e-63
=====================================================================================
Species=versicolor       coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
Sepal.Length          -1.4959    444.817     -0.003      0.997      -873.321   870.330
Sepal.Width           -8.0560    282.766     -0.028      0.977      -562.267   546.155
Petal.Length          11.9301    374.116      0.032      0.975      -721.323   745.184
Petal.Width            1.7039    759.366      0.002      0.998     -1486.627  1490.035
const                  1.6444   1550.515      0.001      0.999     -3037.309  3040.597
--------------------------------------------------------------------------------------
Species=virginica       coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Sepal.Length         -8.0348    444.835     -0.018      0.986      -879.896   863.827
Sepal.Width         -15.8195    282.793     -0.056      0.955      -570.083   538.444
Petal.Length         22.1797    374.155      0.059      0.953      -711.152   755.511
Petal.Width          14.0603    759.384      0.019      0.985     -1474.304  1502.425
const                -6.5053   1550.533     -0.004      0.997     -3045.494  3032.483
=====================================================================================

关于统计模型中的实现

Logit专门检查是否完美分离,并引发一个异常,可以选择将其减弱为警告. 对于诸如MNLogit之类的其他模型,还没有明确检查是否可以完美分离,主要是因为缺少良好的测试用例和易于识别的一般条件. (例如 https://github.com/statsmodels/statsmodels/issues/516 仍然开放)

Logit checks specifically for perfect separation and raises an Exception that can optionally be weakened to a Warning. For other models like MNLogit, there is not yet an explicit check for perfect separation, largely for the lack of good test cases and easily identifiable general conditions. (several issues like https://github.com/statsmodels/statsmodels/issues/516 are still open)

我的总体策略:

当收敛失败时,请尝试使用不同的优化器和不同的起始值(start_params).如果某些优化器成功,则可能是一个困难的优化问题,可能是目标函数的曲率,缩放比例的解释变量差或类似的问题.一个有用的检查是使用诸如nmpowell之类的健壮优化器的参数估计值作为更严格的优化器(如newtonbfgs)的起始值.

When there is a convergence failure, then try different optimizers and different starting values (start_params). If some optimizers succeed, then it might be a difficult optimization problem, either with the curvature of the objective function, badly scaled explanatory variables or similar. A useful check is to use the parameter estimates of robust optimizers like nm or powell as starting values for the optimizers that are more strict, like newton or bfgs.

如果某些优化程序收敛后结果仍然不佳,则可能是数据固有的问题,例如Logit,Probit和其他几个模型中的完美分离或奇异或接近奇异的设计矩阵.在这种情况下,必须更改模型.可以通过互联网搜索找到有关完美分离的建议.

If the results are still not good after convergence of some optimizers, then it might be an inherent problem with the data like perfect separation in Logit, Probit and several other models or a singular or near singular design matrix. In that case the model has to be changed. Recommendation for perfect separation can be found with a internet search.

这篇关于statsmodel中的MNLogit返回nan的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆