为什么统计模型不能重现我的逻辑回归结果? [英] Why can't statsmodels reproduce my R logistic regression results?

查看:1246
本文介绍了为什么统计模型不能重现我的逻辑回归结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很困惑,为什么我的逻辑回归模型在R和统计模型不一致。



如果我在R中准备一些数据,那么

 #从https ://courses.edx.org/c4x/MITx/15.071x/asset/census.csv 
库(caTools)#for sample.split
census = read.csv(census.csv)
set.seed(2000)
split = sample.split(census $ over50k,SplitRatio = 0.6)
censusTrain = subset(census,split == TRUE)
censusTest = (census,split == FALSE)

然后运行一个逻辑回归与

  CensusLog1 = glm(超过50k〜。,data = censusTrain,family = binomial)

我看到结果,如

 估计标准错误z值Pr(> | z |)
(截取)-8.658e + 00 1.379e + 00 -6.279 3.41e-10 ***
年龄2.548e-02 2.139e-03 11.916 < 2e-16 ***
工作类联邦政府1.105e + 00 2.014e-01 5.489 4.03e-08 ***
工作类本地政府3.675e-01 1.821e-01 2.018 0.043641 *
工作类从未工作过-1.283e + 01 8.453e + 02 -0.015 0.987885
workclass私人6.012e-01 1.626e-01 3.698 0.000218 ***
工作类别Self-emp-inc 7.575e -01 1.950e-01 3.884 0.000103 ***
workclass Self-emp-not-inc 1.855e-01 1.774e-01 1.046 0.295646
workclass State-gov 4.012e-01 1.961e-01 2.046 0.040728 *
工作类别无薪-1.395e + 01 6.597e + 02 -0.021 0.983134
...

,但是在Python中使用相同的数据,首先从R导出

  write。 csv(censusTrain,file =traincensus.csv)
write.csv(censusTest,file =testcensus.csv)

然后im移植到Python与

 导入熊猫为pd 

census = pd.read_csv(census.csv )
census_train = pd.read_csv(traincensus.csv)
census_test = pd.read_csv(testcensus.csv)

我收到的错误和奇怪的结果与R中没有关系。



如果我只是尝试

  import statsmodels.api as sm 

census_log_1 = sm.Logit.from_formula(f,census_train) .fit()

我收到错误:



ValueError:不能与操作数一起广播(19187,2)(19187)

即使用 patsy 使用

  import patsy 
f ='over50k〜'+'+'.join(list(census.columns)[: - 1])$ ​​b $ by,X = patsy.dmatrices(f,census_train, return_type ='dataframe')

尝试

  census_log_1 = sm.Logit(y,X).fit()

导致相同的错误。我可以避免错误的唯一方法是使用 GLM

  census_log_1 = sm.GLM(y,X,family = sm.families.Binomial())。fit()

但这会产生完全不同于(我以为是)的产品的结果等效的R API:

  coef std err t P> | t | [95.0%Conf。 Int。] 
------------------------------------------- -------------------------------------------------- -------------------
截取10.6766 5.985 1.784 0.074 -1.055 22.408
年龄-0.0255 0.002 -11.916 0.000 -0.030 -0.021
工作班[T。 Federal-gov] -0.9775 4.498 -0.217 0.828 -9.794 7.839
workclass [T。本地政府] -0.2395 4.498 -0.053 0.958 -9.055 8.576
workclass [T。从未工作] 8.8346 114.394 0.077 0.938 -215.374 233.043
workclass [T。私人] -0.4732 4.497 -0.105 0.916 -9.288 8.341
workclass [T。 Self-emp-inc] -0.6296 4.498 -0.140 0.889 -9.446 8.187
workclass [T。 Self-emp-not-inc] -0.0576 4.498 -0.013 0.990 -8.873 8.758
workclass [T。州政府] -0.2733 4.498 -0.061 0.952 -9.090 8.544
workclass [T。无付款] 10.0745 85.048 0.118 0.906 -156.616 176.765
...

为什么是物流Python中产生错误的回归和与R产生的结果不同?这些API实际上是不是相当的(我以前有过他们的工作以产生相同的结果)?有没有一些额外的处理数据集,使它们可以被statsmodels使用?

解决方案

错误是由于patsy将LHS变量扩展为完全处理对比度。 Logit没有处理这一点,如docstring中所示,但是您看到有二项式族的GLM。



我不能说结果没有完整的差异输出。很可能它是不同的默认处理分类变量,或者你使用不同的变量。不是全部都列在你的输出中。



您可以通过执行以下预处理步骤来使用logit。

  census = census.replace(to_replace = {'over50k':{'< = 50K':0,'> 50K':1}})

还要注意,logit的默认解算器似乎并不适用于此问题。它遇到一个奇异的矩阵问题。事实上,这个问题的条件数量是巨大的,你在R中得到的结果可能不是一个完全融合的模型。您可以尝试减少虚拟变量的数量。

  [〜/] 
[73]:np.linalg。 cond(mod.exog)
[73]:4.5139498536894682e + 17

使用以下方式获得解决方案

  mod = sm.formula.logit(f,data = census)
res = mod.fit(method ='bfgs',maxiter = 1000)

非常小这被其他稀疏虚拟变量复合。

  [〜/] 
[81]:pd.Categorical人口普查.ccupation).describe()
[81]:
计数freqs
级别
? 1816 0.056789
Adm-clerical 3721 0.116361
武装部队9 0.000281
工艺修理4030 0.126024
执行管理3992 0.124836
农业捕鱼989 0.030928
Handlers-cleaners 1350 0.042217
机器操作1966 0.061480
其他服务3212 0.100444
Priv-house-serv 143 0.004472
Prof-specialty 4038 0.126274
保护服务644 0.020139
销售3584 0.112077
技术支持912 0.028520
运输1572 0.049159


I'm confused about why my logistic regression models in R and statsmodels do not agree.

If I prepare some data in R with

# From https://courses.edx.org/c4x/MITx/15.071x/asset/census.csv
library(caTools) # for sample.split
census = read.csv("census.csv")
set.seed(2000)
split = sample.split(census$over50k, SplitRatio = 0.6)
censusTrain = subset(census, split==TRUE)
censusTest = subset(census, split==FALSE)

and then run a logistic regression with

CensusLog1 = glm(over50k ~., data=censusTrain, family=binomial)

I see results like

                                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)                              -8.658e+00  1.379e+00  -6.279 3.41e-10 ***
age                                       2.548e-02  2.139e-03  11.916  < 2e-16 ***
workclass Federal-gov                     1.105e+00  2.014e-01   5.489 4.03e-08 ***
workclass Local-gov                       3.675e-01  1.821e-01   2.018 0.043641 *  
workclass Never-worked                   -1.283e+01  8.453e+02  -0.015 0.987885    
workclass Private                         6.012e-01  1.626e-01   3.698 0.000218 ***
workclass Self-emp-inc                    7.575e-01  1.950e-01   3.884 0.000103 ***
workclass Self-emp-not-inc                1.855e-01  1.774e-01   1.046 0.295646    
workclass State-gov                       4.012e-01  1.961e-01   2.046 0.040728 *  
workclass Without-pay                    -1.395e+01  6.597e+02  -0.021 0.983134   
...

but of I use the same data in Python, by first exporting from R with

write.csv(censusTrain,file="traincensus.csv")
write.csv(censusTest,file="testcensus.csv")

and then importing into Python with

import pandas as pd

census = pd.read_csv("census.csv")
census_train = pd.read_csv("traincensus.csv")
census_test = pd.read_csv("testcensus.csv")

I get errors and strange results that bear no relationship to the ones I get in R.

If I simply try

import statsmodels.api as sm

census_log_1 = sm.Logit.from_formula(f, census_train).fit()

I get an error:

ValueError: operands could not be broadcast together with shapes (19187,2) (19187,) 

Even if prepare the data with patsy using

import patsy
f = 'over50k ~ ' + ' + '.join(list(census.columns)[:-1])
y, X = patsy.dmatrices(f, census_train, return_type='dataframe')

trying

census_log_1 = sm.Logit(y, X).fit()

results in the same error. The only way I can avoid errors is to use use GLM

census_log_1 = sm.GLM(y, X, family=sm.families.Binomial()).fit()

but this produces results that are entirely different from those produced by (what I thought was) the equivalent R API:

                                                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------------------------------
Intercept                                       10.6766      5.985      1.784      0.074        -1.055    22.408
age                                             -0.0255      0.002    -11.916      0.000        -0.030    -0.021
workclass[T. Federal-gov]                       -0.9775      4.498     -0.217      0.828        -9.794     7.839
workclass[T. Local-gov]                         -0.2395      4.498     -0.053      0.958        -9.055     8.576
workclass[T. Never-worked]                       8.8346    114.394      0.077      0.938      -215.374   233.043
workclass[T. Private]                           -0.4732      4.497     -0.105      0.916        -9.288     8.341
workclass[T. Self-emp-inc]                      -0.6296      4.498     -0.140      0.889        -9.446     8.187
workclass[T. Self-emp-not-inc]                  -0.0576      4.498     -0.013      0.990        -8.873     8.758
workclass[T. State-gov]                         -0.2733      4.498     -0.061      0.952        -9.090     8.544
workclass[T. Without-pay]                       10.0745     85.048      0.118      0.906      -156.616   176.765
...

Why is logistic regression in Python producing errors and different results from those produced by R? Are these APIs not in fact equivalent (I've had them work before to produce identical results)? Is there some additional processing of the datasets required to make them usable by statsmodels?

解决方案

The error is due to the fact that patsy expands the LHS variable to be a full Treatement contrast. Logit does not handle this as indicated in the docstring, but as you see GLM with binomial family does.

I can't speak to the difference in the results without a full output. In all likelihood it's different default handling of categorical variables or you're using different variables. Not all are listed in your output.

You can use logit by doing the following pre-processing step.

census = census.replace(to_replace={'over50k' : {' <=50K' : 0, ' >50K' : 1}})

Note also that the default solver for logit doesn't seem to work all that well for this problem. It runs into a singular matrix problem. Indeed, the condition number for this problem is huge, and what you get in R might not be a fully converged model. You might try reducing your number of dummy variables.

[~/]
[73]: np.linalg.cond(mod.exog)
[73]: 4.5139498536894682e+17

I had to use the following to get a solution

mod = sm.formula.logit(f, data=census)
res = mod.fit(method='bfgs', maxiter=1000)    

Some of your cells end up being very small. This is compounded by the other sparse dummy variables.

[~/]
[81]: pd.Categorical(census.occupation).describe()
[81]: 
                    counts     freqs
levels                              
?                    1816  0.056789
Adm-clerical         3721  0.116361
Armed-Forces            9  0.000281
Craft-repair         4030  0.126024
Exec-managerial      3992  0.124836
Farming-fishing       989  0.030928
Handlers-cleaners    1350  0.042217
Machine-op-inspct    1966  0.061480
Other-service        3212  0.100444
Priv-house-serv       143  0.004472
Prof-specialty       4038  0.126274
Protective-serv       644  0.020139
Sales                3584  0.112077
Tech-support          912  0.028520
Transport-moving     1572  0.049159

这篇关于为什么统计模型不能重现我的逻辑回归结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆