Logistic回归系数scikit-learn与statsmodels [英] Coefficients for Logistic Regression scikit-learn vs statsmodels

查看:141
本文介绍了Logistic回归系数scikit-learn与statsmodels的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用两个API进行逻辑回归时,它们给出不同的系数.即使使用这个简单的示例,它在系数方面也不会产生相同的结果.而且,我遵循了关于同一主题的较早建议的建议,例如在sklearn中为参数C设置一个较大的值,因为它会使处罚几乎消失(或设置刑罚=无").

When performed a logistic regression using the two API, they give different coefficients. Even with this simple example it doesn't produce the same results in terms of coefficients. And I follow advice from older advice on the same topic, like setting a large value for the parameter C in sklearn since it makes the penalization almost vanish (or setting penalty="none").

import pandas as pd
import numpy as np
import sklearn as sk
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm

n = 200

x = np.random.randint(0, 2, size=n)
y = (x > (0.5 + np.random.normal(0, 0.5, n))).astype(int)

display(pd.crosstab( y, x ))


max_iter = 100

#### Statsmodels
res_sm = sm.Logit(y, x).fit(method="ncg", maxiter=max_iter)
print(res_sm.params)

#### Scikit-Learn
res_sk = LogisticRegression( solver='newton-cg', multi_class='multinomial', max_iter=max_iter, fit_intercept=True, C=1e8 )
res_sk.fit( x.reshape(n, 1), y )
print(res_sk.coef_)

例如,我只运行上面的代码,为statsmodels获得1.72276655,为sklearn获得1.86324749.而且,当多次运行时,它总是给出不同的系数(有时比其他系数更小,但无论如何).

For example I just run the above code and get 1.72276655 for statsmodels and 1.86324749 for sklearn. And when run multiple times it always gives different coefficients (sometimes closer than others, but anyway).

因此,即使在这个玩具示例中,两个API给出的系数也不同(因此,优势比),而对于真实数据(此处未显示),它几乎变得失控" ...

Thus, even with that toy example the two APIs give different coefficients (so odds ratios), and with real data (not shown here), it almost get "out of control"...

我错过了什么吗?如何产生相似的系数,例如至少在逗号后的一个或两个数字处?

Am I missing something? How can I produce similar coefficients, for example at least at one or two numbers after the comma?

推荐答案

您的代码存在一些问题.

There are some issues with your code.

首先,您在此处显示的两个模型等效:尽管您使用 fit_intercept = True 使scikit学习 LogisticRegression (这是默认设置),而您的statsmodels不会这样做;来自statsmodels docs :

To start with, the two models you show here are not equivalent: although you fit your scikit-learn LogisticRegression with fit_intercept=True (which is the default setting), you don't do so with your statsmodels one; from the statsmodels docs:

默认情况下不包括拦截器,用户应添加.参见 statsmodels.tools.add_constant .

这似乎是一个常见的混淆点 - 例如参见 Scikit学习与statsmodels - 哪个 R 平方是正确的?(以及自己的答案).

It seems that this is a frequent point of confusion - see for example scikit-learn & statsmodels - which R-squared is correct? (and own answer there as well).

另一个问题是,尽管您处于二进制分类设置中,但您在 LogisticRegression 中要求 multi_class ='multinomial',但实际情况并非如此

The other issue is that, although you are in a binary classification setting, you ask for multi_class='multinomial' in your LogisticRegression, which should not be the case.

第三个问题是,如相关交叉验证线程 Logistic所述回归:Scikit学习与统计模型:

The third issue is that, as explained in the relevant Cross Validated thread Logistic Regression: Scikit Learn vs Statsmodels:

在scikit-learn中无法关闭正则化,但是可以通过将调整参数C设置为较大的数量来使其无效.

There is no way to switch off regularization in scikit-learn, but you can make it ineffective by setting the tuning parameter C to a large number.

这使得两个模型在原理上再次不可比,但是您已经通过设置 C = 1e8 成功地解决了这两个问题.实际上,自那时以来(2016),根据

which makes the two models again non-comparable in principle, but you have successfully addressed it here by setting C=1e8. In fact, since then (2016), scikit-learn has indeed added a way to switch regularization off, by setting penalty='none' since, according to the docs:

如果为"none"(liblinear求解器不支持),则不应用任何正则化.

If ‘none’ (not supported by the liblinear solver), no regularization is applied.

现在应将其视为关闭正则化的规范方法.

which should now be considered the canonical way to switch off the regularization.

因此,将这些更改合并到您的代码中,我们可以:

So, incorporating these changes in your code, we have:

np.random.seed(42) # for reproducibility

#### Statsmodels
# first artificially add intercept to x, as advised in the docs:
x_ = sm.add_constant(x)
res_sm = sm.Logit(y, x_).fit(method="ncg", maxiter=max_iter) # x_ here
print(res_sm.params)

哪个给出结果:

Optimization terminated successfully.
         Current function value: 0.403297
         Iterations: 5
         Function evaluations: 6
         Gradient evaluations: 10
         Hessian evaluations: 5
[-1.65822763  3.65065752]

,数组的第一个元素为截距,第二个为 x 的系数.在为scikit学习时,我们有:

with the first element of the array being the intercept and the second the coefficient of x. While for scikit learn we have:

#### Scikit-Learn

res_sk = LogisticRegression(solver='newton-cg', max_iter=max_iter, fit_intercept=True, penalty='none')
res_sk.fit( x.reshape(n, 1), y )
print(res_sk.intercept_, res_sk.coef_)

结果为:

[-1.65822806] [[3.65065707]]

在机器的数值精度范围内,这些结果实际上是相同的.

These results are practically identical, within the machine's numeric precision.

针对不同的 np.random.seed()值重复该过程不会改变上面显示的结果的本质.

Repeating the procedure for different values of np.random.seed() does not change the essence of the results shown above.

这篇关于Logistic回归系数scikit-learn与statsmodels的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆