为什么statsmodels和R之间的逻辑回归结果不同? [英] Why are the logistic regression results different between statsmodels and R?

查看:139
本文介绍了为什么statsmodels和R之间的逻辑回归结果不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试比较python的statsmodels和R中的逻辑回归实现.

I am trying to compare the logistic regression implementations in python's statsmodels and R.

Python版本:

import statsmodels.api as sm
import pandas as pd
import pylab as pl
import numpy as np
df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
df.columns = list(df.columns)[:3] + ["prestige"]
# df.hist()
# pl.show()
dummy_ranks = pd.get_dummies(df["prestige"], prefix="prestige")
cols_to_keep = ["admit", "gre", "gpa"]
data = df[cols_to_keep].join(dummy_ranks.ix[:, "prestige_2":])
data["intercept"] = 1.0
train_cols = data.columns[1:]
logit = sm.Logit(data["admit"], data[train_cols])
result = logit.fit()
result.summary2()

结果:

                         Results: Logit
=================================================================
Model:              Logit            Pseudo R-squared: 0.083     
Dependent Variable: admit            AIC:              470.5175  
Date:               2014-12-19 01:11 BIC:              494.4663  
No. Observations:   400              Log-Likelihood:   -229.26   
Df Model:           5                LL-Null:          -249.99   
Df Residuals:       394              LLR p-value:      7.5782e-08
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     6.0000                                       
------------------------------------------------------------------
               Coef.   Std.Err.     z     P>|z|    [0.025   0.975]
------------------------------------------------------------------
gre            0.0023    0.0011   2.0699  0.0385   0.0001   0.0044
gpa            0.8040    0.3318   2.4231  0.0154   0.1537   1.4544
prestige_2    -0.6754    0.3165  -2.1342  0.0328  -1.2958  -0.0551
prestige_3    -1.3402    0.3453  -3.8812  0.0001  -2.0170  -0.6634
prestige_4    -1.5515    0.4178  -3.7131  0.0002  -2.3704  -0.7325
intercept     -3.9900    1.1400  -3.5001  0.0005  -6.2242  -1.7557
=================================================================

R版本:

data = read.csv("http://www.ats.ucla.edu/stat/data/binary.csv", head=T)
require(reshape2)
data1 = dcast(data, admit + gre + gpa ~ rank)
require(dplyr)
names(data1)[4:7] = paste("rank", 1:4, sep="")
data1 = data1[, -4]
summary(glm(admit ~ gre + gpa + rank2 + rank3 + rank4, family=binomial, data=data1))

结果:

Call:
glm(formula = admit ~ gre + gpa + rank2 + rank3 + rank4, family = binomial,
    data = data1)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-1.5133  -0.8661  -0.6573   1.1808   2.0629

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.184029   1.162421  -3.599 0.000319 ***
gre          0.002358   0.001112   2.121 0.033954 *
gpa          0.770591   0.343908   2.241 0.025046 *
rank2       -0.369711   0.310342  -1.191 0.233535
rank3       -1.015012   0.335147  -3.029 0.002457 **
rank4       -1.249251   0.414416  -3.014 0.002574 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 466.13  on 377  degrees of freedom
Residual deviance: 434.12  on 372  degrees of freedom
AIC: 446.12

Number of Fisher Scoring iterations: 4

结果非常不同,例如rank_2的p值分别为0.03和0.2.我想知道造成这种差异的原因是什么?请注意,我已经为两个版本创建了虚拟变量,并为python版本创建了一个常量列,这在R中会自动处理.

The results are quite different, for example, the p-values for rank_2 are 0.03 and 0.2 respectively. I am wondering what are causes of this difference? Note that I have created dummy variables for both versions, and a constant column for the python version, which is automatically taken care of in R.

此外,似乎python快了2倍:

Also, it seems python is 2x faster:

##################################################
# python timing
def test():
    for i in range(5000):
        logit = sm.Logit(data["admit"], data[train_cols])
        result = logit.fit(disp=0)
import time
start = time.time()
test()
print(time.time() - start)
10.099738836288452
##################################################
# R timing
> f = function() for(i in 1:5000) {mod = glm(admit ~ gre + gpa + rank2 + rank3 + rank4, family=binomial, data=data1)}
> system.time(f())
   user  system elapsed
 17.505   0.021  17.526

推荐答案

不确定您的数据操作打算做什么,但是它们似乎在R运行中丢失了信息.如果我保留所有等级信息,则将其获得到原始数据对象上(结果在它们重叠的区域看起来非常相似.(似然估计仅是一个任意常数,因此您只能在以下方面进行比较)对数似然.即使有这种警告,偏差也应该是负对数似然的两倍,因此这些结果也具有可比性.)

Not sure what your data manipulations are intending but they seem to be loosing information in the R run. If I keep all the rank information in, then I get this on the original data-object (and the results look very similar in the areas they overlap on. (Likelihoods are only estimated up to an arbitrary constant so you can only compare differences in log-likelihood. Even with that caveat the deviance is supposed to be twice the negative log-likelihood so those results are also comparable.)

> summary(glm(admit ~ gre + gpa +as.factor( rank), family=binomial,
       data=data))  # notice that I'm using your original data-object

Call:
glm(formula = admit ~ gre + gpa + as.factor(rank), family = binomial, 
    data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6268  -0.8662  -0.6388   1.1490   2.0790  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -3.989979   1.139951  -3.500 0.000465 ***
gre               0.002264   0.001094   2.070 0.038465 *  
gpa               0.804038   0.331819   2.423 0.015388 *  
as.factor(rank)2 -0.675443   0.316490  -2.134 0.032829 *  
as.factor(rank)3 -1.340204   0.345306  -3.881 0.000104 ***
as.factor(rank)4 -1.551464   0.417832  -3.713 0.000205 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 458.52  on 394  degrees of freedom
AIC: 470.52

Number of Fisher Scoring iterations: 4

这篇关于为什么statsmodels和R之间的逻辑回归结果不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆