statsmodels如何编码以字符串形式输入的endg变量? [英] How does statsmodels encode endog variables entered as strings?

查看:58
本文介绍了statsmodels如何编码以字符串形式输入的endg变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是第一次使用statsmodels进行统计分析.在大多数情况下,我都会得到预期的答案,但是对于当以字符串形式输入时statsmodels为逻辑回归定义endg(因变量)进行逻辑回归的方式,有些事情我不太了解.

I'm new to using statsmodels to do statistical analyses. I'm getting expected answers most of the time but there are some things I don't quite understand about the way that statsmodels defines endog (dependant) variables for logistic regression when entered as strings.

可以如下所示定义一个用于说明问题的熊猫数据框示例.yN,yA和yA2列代表定义endg变量的不同方法:yN是编码为0、1的二进制变量; yN是编码为0、1的二进制变量.yA是一个二进制变量,编码为"y","n";yA2是一个编码为'x','y'和'w'的变量:

An example Pandas dataframe to illustrate the issue can be defined as shown below. The yN, yA and yA2 columns represent different ways to define an endog variable: yN is a binary variable coded 0, 1; yA is a binary variable coded 'y', 'n'; and yA2 is a variable coded 'x', 'y' and 'w':

import pandas as pd

df = pd.DataFrame({'yN':[0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1],
                   'yA':['y','y','y','y','y','y','y','n','n','n','n','n','n','n','n','n','n','n','n','n',],
                   'yA2':['y','y','y','w','y','w','y','n','n','n','n','n','n','n','n','n','n','n','n','n',],
                   'xA':['a','a','b','b','b','c','c','c','c','c','a','a','a','a','b','b','b','b','c','c']})

数据框如下:

   xA yA yA2  yN
0   a  y   y   0
1   a  y   y   0
2   b  y   y   0
3   b  y   w   0
4   b  y   y   0
5   c  y   w   0
6   c  y   y   0
7   c  n   n   1
8   c  n   n   1
9   c  n   n   1
10  a  n   n   1
11  a  n   n   1
12  a  n   n   1
13  a  n   n   1
14  b  n   n   1
15  b  n   n   1
16  b  n   n   1
17  b  n   n   1
18  c  n   n   1
19  c  n   n   1

我可以使用0/1编码的endg变量和分类exog变量(xA)进行标准"逻辑回归,如下所示:

I can run a 'standard' logistic regression using a 0/1 encoded endog variable and a categorical exog variable (xA) as follows:

import statsmodels.formula.api as smf
import statsmodels.api as sm

phjLogisticRegressionResults = smf.glm(formula='yN ~ C(xA)',
                                       data=df,
                                       family = sm.families.Binomial(link = sm.genmod.families.links.logit)).fit()

print('\nResults of logistic regression model')
print(phjLogisticRegressionResults.summary())

这会产生以下结果,正如我所期望的:

This produces the following results, which are exactly as I'd expect:

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                     yN   No. Observations:                   20
Model:                            GLM   Df Residuals:                       17
Model Family:                Binomial   Df Model:                            2
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -12.787
Date:                Thu, 18 Jan 2018   Deviance:                       25.575
Time:                        02:19:45   Pearson chi2:                     20.0
No. Iterations:                     4                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.6931      0.866      0.800      0.423      -1.004       2.391
C(xA)[T.b]    -0.4055      1.155     -0.351      0.725      -2.669       1.858
C(xA)[T.c]     0.2231      1.204      0.185      0.853      -2.137       2.583
==============================================================================

但是,如果我使用二进制endg变量对'y'和'n'进行编码(但与上一个示例中的直观0/1编码完全相反)来运行相同的模型,或者如果我包含了其中一些"y"代码已被"w"替换(即现在有3个结果),它仍然产生与以下相同的结果:

However, if I run the same model using a binary endog variable encode 'y' and 'n' (but exactly opposite to the intuitive 0/1 coding in previous example) or if I include a variable where some of the 'y' codes have been replaced by 'w' (i.e. there are now 3 outcomes), it still produces the same results as follows:

phjLogisticRegressionResults = smf.glm(formula='yA ~ C(xA)',
                                       data=df,
                                       family = sm.families.Binomial(link = sm.genmod.families.links.logit)).fit()


                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:     ['yA[n]', 'yA[y]']   No. Observations:                   20
Model:                            GLM   Df Residuals:                       17
Model Family:                Binomial   Df Model:                            2
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -12.787
Date:                Thu, 18 Jan 2018   Deviance:                       25.575
Time:                        02:29:06   Pearson chi2:                     20.0
No. Iterations:                     4                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.6931      0.866      0.800      0.423      -1.004       2.391
C(xA)[T.b]    -0.4055      1.155     -0.351      0.725      -2.669       1.858
C(xA)[T.c]     0.2231      1.204      0.185      0.853      -2.137       2.583
==============================================================================

...和...

phjLogisticRegressionResults = smf.glm(formula='yA2 ~ C(xA)',
                                       data=df,
                                       family = sm.families.Binomial(link = sm.genmod.families.links.logit)).fit()


                       Generalized Linear Model Regression Results                        
==========================================================================================
Dep. Variable:     ['yA2[n]', 'yA2[w]', 'yA2[y]']   No. Observations:                   20
Model:                                        GLM   Df Residuals:                       17
Model Family:                            Binomial   Df Model:                            2
Link Function:                              logit   Scale:                             1.0
Method:                                      IRLS   Log-Likelihood:                -12.787
Date:                            Thu, 18 Jan 2018   Deviance:                       25.575
Time:                                    02:29:06   Pearson chi2:                     20.0
No. Iterations:                                 4                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.6931      0.866      0.800      0.423      -1.004       2.391
C(xA)[T.b]    -0.4055      1.155     -0.351      0.725      -2.669       1.858
C(xA)[T.c]     0.2231      1.204      0.185      0.853      -2.137       2.583
==============================================================================

部门.输出表中的可变单元格可以识别,但是存在差异,但结果相同.statsmodels在将endg变量输入为字符串变量时使用什么规则进行编码?

The Dep. Variable cell in the output table recognises but that there are differences but the results are the same. What rule is statsmodels using to code the endog variable when it is entered as a string variable?

推荐答案

警告:此行为不是设计使然,而是通过patsy和statsmodels的相互作用而实现的.

Warning: this behavior is not by design and came about through the interaction of patsy and statsmodels.

首先,patsy对字符串公式和数据进行所有转换,以创建相应的设计矩阵,并可能对响应变量进行转换.如果响应变量 endog 或y是字符串,则patsy将其视为分类,并使用默认编码分类变量并构建对应的伪变量数组.另外,AFAIK patsy按字母数字顺序对级别进行排序,从而确定列的顺序.

First, patsy does all the conversion of the string formula and data to create the corresponding design matrix, and possibly the conversion for the response variable. In the case when the response variable, endog or y, is a string, then patsy treats it as categorical and uses the default encoding for categorical variables and build the corresponding array of dummy variables. Also, AFAIK patsy sorts the levels alphanumerically which determines the order of the columns.

模型的主要部分(GLM或Logit/Probit)仅采用patsy提供的数组,并在可能的情况下以模型适当的方式对其进行解释,而无需进行很多特定的输入检查.

The main part of the model, either GLM or Logit/Probit, just takes the array provided by patsy and interprets it in the model appropriate way if that's possible and does so without much specific input checking.

在示例中,patsy创建具有两列或三列的虚拟变量数组.statsmodels 将其解释为成功",失败"很重要.因此,按字母数字顺序排列的最低类别定义了成功".行总和对应于观察中的试验次数,在这种情况下为1.如果或它适用于三列,则必须缺少输入检查,这意味着它是第一个反对其余的二进制响应.(我想这是实现细节的结果,而不是设计使然.)

In the example patsy creates the dummy variable array with either two or three columns. statsmodels interprets it as "success", "failure" counts. So the lowest category in alphanumerical order defines "success". The row sum corresponds to the number of trials in the observation, which is 1 in this case. If or that it works for three columns must be a lack of input checking which implies that it is a first-against-the-rest binary response. (Which, I guess, is a consequence of the implementation details, and is not by design.)

一个相关的问题是离散模型Logit. https://github.com/statsmodels/statsmodels/issues/2733 目前尚无明确的解决方案,不需要花费很多时间来猜测用户的意图.

A related problem is in discrete model Logit. https://github.com/statsmodels/statsmodels/issues/2733 There is no clear solution for now that would not require a lot of second guessing the user's intention.

因此,现在最好对二进制模型使用数值,尤其是因为定义成功"和参数符号取决于类别级别名称的字母数字顺序的情况.例如,在将成功"级别重命名为"z"而不是"n"之后,尝试.

So for now it's better to use numerical values for binary models, especially also because what defines "success" and the signs of the parameter depends on the alphanumerical ordering of the categorical level names. For example, try after renaming your "success" level to "z" instead of "n".

这篇关于statsmodels如何编码以字符串形式输入的endg变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆