iPython和R之间GLM结果的差异 [英] Difference in GLM results between iPython and R

查看:562
本文介绍了iPython和R之间GLM结果的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图在R中执行回归分析。下面是我在R中生成的一些随机虚拟数据,在R中运行逻辑glm。我已将数据保存到测试文件中,读取使用ipython进入python(ipython notebook非常棒,只是刚开始使用它!),然后尝试使用python运行相同的分析。
结果非常相似,但它们不同。我本来希望它们是一样的。我做错了什么,是否有我缺少的参数,或者由于某些基础计算导致的差异?

I'm trying to get to grips with performing regression analyses in R. Below is some random dummy data that I have generated in R, run a logistic glm in R. I have saved the data into a test file, read that into python with ipython (ipython notebook is awesome btw, only just started using it!), and then tried to run the same analyis with python. The results are very similar but they are different. I kind of would have expected them to be the same. Have I done something wrong, is there a parameter I am missing, or the difference due to some underlying calculation?

任何帮助表示感谢!

编辑:我不知道这是否值得关闭,但我用Ben的编辑(和更清洁)代码重新编写代码,python和R之间的结果现在是相同。我根本没有更改python代码,以及我以前的R代码和Ben的代码,而不同的应该是(据我所见)做同样的事情。无论问题的重点是现在都没有实际意义。尽管如此,谢谢你看看。

I don't know if this warrants closing, but I re -ran things with Ben's edited (and cleaner) code and the results between python and R are now the same. I didn't change the python code at all, and my previous R code and Ben's code, while different should be (as far as I can see) doing the same thing. Regardless the point of the question is now moot. Nonetheless, Thanks for taking a look.

生成随机数据并运行glm:

Generate random data and run glm:

set.seed(666)
dat <- data.frame(a=rnorm(500), b=runif(500), 
                  c=as.factor(sample(1:5, 500, replace=TRUE)))
library(plyr)
dat <- mutate(dat,
              y0=((jitter(a)^2+(-log10(b)))/(as.numeric(c)/10))+rnorm(500),
              y=(y0>=mean(y0)))                  

fit1 <- glm(y~a+b+c,data=dat,family=binomial('logit'))
summary(fit1)

Call:
glm(formula = y ~ a + b + c, family = binomial("logit"), data = dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5369  -0.8154  -0.5479   0.9314   2.3831  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.2363     0.3007   4.111 3.94e-05 ***
a            -0.2051     0.1062  -1.931   0.0535 .  
b            -1.6103     0.3834  -4.200 2.67e-05 ***
c2           -0.5114     0.3091  -1.654   0.0980 .  
c3           -1.3169     0.3147  -4.184 2.86e-05 ***
c4           -2.0017     0.3342  -5.990 2.09e-09 ***
c5           -2.5084     0.3772  -6.651 2.92e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 631.31  on 499  degrees of freedom
Residual deviance: 537.84  on 493  degrees of freedom
AIC: 551.84

Number of Fisher Scoring iterations: 4

输出用于python的相同数据:

output the same data for use in python:

write.table(dat, file='test.txt', row.names=F, col.names=T, quote=F, sep=',' )

现在是python代码:

and now the python code:

import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np

data = pd.read_csv('test.txt')
data.describe()
dummy_c = pd.get_dummies(data['c'], prefix='c')
data = data[['y','a','b']].join(dummy_c.ix[:,'c_2':])
data_depend = data['y']
data_independ = data.iloc[:,1:]
data_independ = sm.add_constant(data_independ, prepend=False)
glm_binom = sm.GLM(data_depend, data_independ, family=sm.families.Binomial())
res = glm_binom.fit()
print res.summary()

产生:

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  500
Model:                            GLM   Df Residuals:                      493
Model Family:                Binomial   Df Model:                            6
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -268.92
Date:                Sun, 27 Oct 2013   Deviance:                       537.84
Time:                        01:26:47   Pearson chi2:                     514.
No. Iterations:                     6                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
a             -0.2051      0.106     -1.931      0.054        -0.413     0.003
b             -1.6103      0.383     -4.200      0.000        -2.362    -0.859
c_2           -0.5114      0.309     -1.654      0.098        -1.117     0.094
c_3           -1.3169      0.315     -4.184      0.000        -1.934    -0.700
c_4           -2.0017      0.334     -5.990      0.000        -2.657    -1.347
c_5           -2.5084      0.377     -6.651      0.000        -3.248    -1.769
const          1.2363      0.301      4.111      0.000         0.647     1.826
==============================================================================


推荐答案

一点代码清理:

set.seed(101)
dat <- data.frame(a=rnorm(500), b=runif(500), 
          c=as.factor(sample(1:5, 500, replace=TRUE)))
library(plyr)
dat <- mutate(dat,
         y0=((jitter(a)^2+(-log10(b)))/(as.numeric(c)/10))+rnorm(500),
         y=(y0>=mean(y0)))                  

fit1 <- glm(y~a+b+c,data=dat,family=binomial('logit'))
fit2 <- update(fit1,control=glm.control(maxit=6))
all.equal(fit1,fit2)

coef(fit1)
## (Intercept)           a           b          c2          c3          c4 
##  1.22283193 -0.07544488 -1.54732712 -0.36477556 -1.46313143 -1.95008291 
##          c5 
## -3.11914945

我同意@ Roland的评论,一个可重复的例子将有助于。最可能的区别在于对比编码,例如:

I agree with @Roland's comment that a reproducible example will help. The most likely difference is in contrast coding, e.g.:

fit3 <- update(fit1,contrasts=list(c=contr.sum))
coef(fit3)
## 
## (Intercept)           a           b          c1          c2          c3 
## -0.15659594 -0.07544488 -1.54732712  1.37942787  1.01465231 -0.08370356 
##          c4 
## -0.57065503 

如果使用仅包含连续预测变量的模型,结果是否更好匹配?

If you use a model with only continuous predictors, do the results match better?

更新:对比编码不可能是整个故事,因为偏差/对数似然以及系数不同。

Update: contrast coding can't be the whole story because the deviance/log-likelihood as well as the coefficients differ.

这篇关于iPython和R之间GLM结果的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆