Logistic回归结果在Scikit python和R中有所不同? [英] Logistic regression results different in Scikit python and R?

查看:83
本文介绍了Logistic回归结果在Scikit python和R中有所不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R和Python上都对虹膜数据集进行了逻辑回归,但是两者给出的结果(系数,截距和得分)不同.

I was running logistic regression on iris dataset on both R and Python.But both are giving different results(coefficients,intercept and scores).

#Python codes.
    In[23]: iris_df.head(5)
    Out[23]: 
     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
    0           5.1          3.5           1.4          0.2        0
    1           4.9          3.0           1.4          0.2        0
    2           4.7          3.2           1.3          0.2        0
    3           4.6          3.1           1.5          0.2        0
    In[35]: iris_df.shape
    Out[35]: (100, 5)
    #looking at the levels of the Species dependent variable..

        In[25]: iris_df['Species'].unique()
        Out[25]: array([0, 1], dtype=int64)

    #creating dependent and independent variable datasets..

        x = iris_df.ix[:,0:4]
        y = iris_df.ix[:,-1]

    #modelling starts..
    y = np.ravel(y)
    logistic = LogisticRegression()
    model = logistic.fit(x,y)
    #getting the model coefficients..
    model_coef= pd.DataFrame(list(zip(x.columns, np.transpose(model.coef_))))
    model_intercept = model.intercept_
    In[30]: model_coef
    Out[36]: 
                  0                  1
    0  Sepal.Length  [-0.402473917528]
    1   Sepal.Width   [-1.46382924771]
    2  Petal.Length    [2.23785647964]
    3   Petal.Width     [1.0000929404]
    In[31]: model_intercept
    Out[31]: array([-0.25906453])
    #scores...
    In[34]: logistic.predict_proba(x)
    Out[34]: 
    array([[ 0.9837306 ,  0.0162694 ],
           [ 0.96407227,  0.03592773],
           [ 0.97647105,  0.02352895],
           [ 0.95654126,  0.04345874],
           [ 0.98534488,  0.01465512],
           [ 0.98086592,  0.01913408],

R代码.

> str(irisdf)
'data.frame':   100 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : int  0 0 0 0 0 0 0 0 0 0 ...

 > model <- glm(Species ~ ., data = irisdf, family = binomial)
Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
> summary(model)

Call:
glm(formula = Species ~ ., family = binomial, data = irisdf)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-1.681e-05  -2.110e-08   0.000e+00   2.110e-08   2.006e-05  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)       6.556 601950.324       0        1
Sepal.Length     -9.879 194223.245       0        1
Sepal.Width      -7.418  92924.451       0        1
Petal.Length     19.054 144515.981       0        1
Petal.Width      25.033 216058.936       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.3863e+02  on 99  degrees of freedom
Residual deviance: 1.3166e-09  on 95  degrees of freedom
AIC: 10

Number of Fisher Scoring iterations: 25

由于收敛问题,我增加了最大迭代次数,使epsilon为0.05.

Due to convergence problem,i increased the maximum iteration and gave epsilon as 0.05.

> model <- glm(Species ~ ., data = irisdf, family = binomial,control = glm.control(epsilon=0.01,trace=FALSE,maxit = 100))
> summary(model)

Call:
glm(formula = Species ~ ., family = binomial, data = irisdf, 
    control = glm.control(epsilon = 0.01, trace = FALSE, maxit = 100))

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-0.0102793  -0.0005659  -0.0000052   0.0001438   0.0112531  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)     1.796    704.352   0.003    0.998
Sepal.Length   -3.426    215.912  -0.016    0.987
Sepal.Width    -4.208    123.513  -0.034    0.973
Petal.Length    7.615    159.478   0.048    0.962
Petal.Width    11.835    285.938   0.041    0.967

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.3863e+02  on 99  degrees of freedom
Residual deviance: 5.3910e-04  on 95  degrees of freedom
AIC: 10.001

Number of Fisher Scoring iterations: 12

#R scores..
> scores = predict(model, newdata = irisdf, type = "response")
> head(scores,5)
           1            2            3            4            5 
2.844996e-08 4.627411e-07 1.848093e-07 1.818231e-06 2.631029e-08 

R和python中的得分,截距和系数都完全不同.哪一个是正确的,我想在python中继续.现在有混淆,结果是准确的.

Both the scores,intercept and coefficients are completely different in R and python.Which one is correct and I want to proceed in python.Now having confusion which results are accurate.

推荐答案

更新 问题在于沿着花瓣宽度变量存在完美的分离.换句话说,此变量可用于完美预测给定数据集中的样本是setosa还是杂色.这打破了在R中进行逻辑回归的对数似然最大化估计.问题是,通过将花瓣宽度的系数设为无穷大,可以将对数似然性驱动得很高.

UPDATED The problem is that there exists perfect separation along the petal width variable. In other words, this variable can be used to perfectly predict whether a sample in the given dataset is setosa or versicolor. This breaks the loglikelihood maximization estimation used in logistic regression in R. The problem is that the loglikelihood can be driven very high by taking the coefficient of petal width to the infinity.

在此处讨论 .

CrossValidated上还有一个不错的线程讨论策略.

There is also a good thread on CrossValidated discussing strategies.

那么为什么sklearn LogisticRegression起作用?因为它采用正则逻辑回归".正则化会惩罚估计参数的大值.

So why does the sklearn LogisticRegression work? Because it employs "regularized logistic regression". The regularization penalizes estimating large values for parameters.

在下面的示例中,我使用Firth的Logistic回归软件包logistf的减少偏倚的方法来生成收敛模型.

In the example below, I use the Firth's bias-reduced method of logistic regression package, logistf, to produce a converged model.

library(logistf)

iris = read.table("path_to _iris.txt", sep="\t", header=TRUE)
iris$Species <- as.factor(iris$Species)
sapply(iris, class)

model1 <- glm(Species ~ ., data = irisdf, family = binomial)
# Does not converge, throws warnings.

model2 <- logistf(Species ~ ., data = irisdf, family = binomial)
# Does converge.

原始 基于R解决方案中的std.error和z值,我认为您的模型规格不好.接近0的z值实际上告诉您模型与因变量之间没有相关性.因此,这是一个荒谬的模型.

ORIGINAL Based on the std.error and z-values in the R solution, I think you have a bad model specification. A z-value of close to 0 essentially tells you there is no correlation between the model and the dependent variable. So this is a nonsensical model.

我的第一个想法是,您需要将该物种"字段转换为分类变量.在您的示例中,它是int类型.尝试使用as.factor

My first thought is that you need to transform that Species field into a categorical variable. It is an int type in your example. Try using as.factor

如何在R中将整数转换为分类数据? /a>

How to convert integer into categorical data in R?

这篇关于Logistic回归结果在Scikit python和R中有所不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆