R中的多项式逻辑多级模型 [英] multinomial logistic multilevel models in R

查看:177
本文介绍了R中的多项式逻辑多级模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:我需要估算一组多项式逻辑多级模型,并且找不到合适的R包.估算此类模型的最佳R包是什么? STATA 13最近在其多级混合效应模型中添加了此功能-因此似乎可以使用估算此类模型的技术.

Problem: I need to estimate a set of multinomial logistic multilevel models and can’t find an appropriate R package. What is the best R package to estimate such models? STATA 13 recently added this feature to their multilevel mixed-effects models – so the technology to estimate such models seems to be available.

详细信息:许多研究问题都需要估计结果变量为分类的多项式Lo​​gistic回归模型.例如,生物学家可能有兴趣调查哪种类型的树木(例如,松树,枫树,橡树)受酸雨影响最大.市场研究人员可能会对客户的年龄与Target,Safeway或Walmart的购物频率之间是否存在关联感兴趣.这些情况的共同点是结果变量是分类的(无序的),多项式逻辑回归是估计的首选方法.在我的案例中,我正在调查人类迁徙类型的差异,结果变量(mig)编码为0 =不迁徙,1 =内部迁徙,2 =国际迁徙.这是我的数据集的简化版本:

Details: A number of research questions require the estimation of multinomial logistic regression models in which the outcome variable is categorical. For example, biologists might be interested to investigate which type of trees (e.g., pine trees, maple trees, oak trees) are most impacted by acid rain. Market researchers might be interested whether there is a relationship between the age of customers and the frequency of shopping at Target, Safeway, or Walmart. These cases have in common that the outcome variable is categorical (unordered) and multinomial logistic regressions are the preferred method of estimation. In my case, I am investigating differences in types of human migration, with the outcome variable (mig) coded 0=not migrated, 1=internal migration, 2=international migration. Here is a simplified version of my data set:

migDat=data.frame(hhID=1:21,mig=rep(0:2,times=7),age=ceiling(runif(21,15,90)),stateID=rep(letters[1:3],each=7),pollution=rep(c("high","low","moderate"),each=7),stringsAsFactors=F)

   hhID mig age stateID pollution
1     1   0  47       a      high
2     2   1  53       a      high
3     3   2  17       a      high
4     4   0  73       a      high
5     5   1  24       a      high
6     6   2  80       a      high
7     7   0  18       a      high
8     8   1  33       b       low
9     9   2  90       b       low
10   10   0  49       b       low
11   11   1  42       b       low
12   12   2  44       b       low
13   13   0  82       b       low
14   14   1  70       b       low
15   15   2  71       c  moderate
16   16   0  18       c  moderate
17   17   1  18       c  moderate
18   18   2  39       c  moderate
19   19   0  35       c  moderate
20   20   1  74       c  moderate
21   21   2  86       c  moderate

我的目标是估计年龄(独立变量)对(1)内部迁移与不迁移,(2)国际迁移与不迁移,(3)内部迁移与国际迁移的几率的影响.另一个复杂之处是我的数据在不同的汇总级别上运行(例如,污染在州级别上运行),而且我还对预测空气污染(污染)对着手某类型运动的几率产生的影响感兴趣.

My goal is to estimate the impact of age (independent variable) on the odds of (1) migrating internally vs. not migrating, (2) migrating internationally vs. not migrating, (3) migrating internally vs. migrating internationally. An additional complication is that my data operate at different aggregation levels (e.g., pollution operates at the state-level) and I am also interested in predicting the impact of air pollution (pollution) on the odds of embarking on a particular type of movement.

笨拙的解决方案:可以通过将每个模型的数据集减少为仅两种迁移类型(例如,模型1:仅编码mig = 0和mig的情况)来估计一组单​​独的逻辑回归模型= 1;模型2:仅编码mig = 0和mig = 2的案例;模型3:仅编码mig = 1和mig = 2的案例).可以使用lme4估计这种简单的多级logistic回归模型,但是这种方法不太理想,因为它没有适当考虑遗漏案例的影响.第二种解决方案是使用R2MLwiN包通过R在MLWiN中运行多项逻辑多级模型.但是由于MLWiN不是开源的,并且生成的对象难以使用,所以我宁愿避免使用此选项.基于全面的Internet搜索,似乎对此类模型有一些需求,但我不知道好的R软件包.因此,如果已经运行过这样的模型的一些专家可以提供建议,并且如果有多个软件包可能表明某些优点/缺点,那将是很好的.我确信这些信息对于多个R用户而言将是非常有用的资源.谢谢!

Clunky solutions: One could estimate a set of separate logistic regression models by reducing the data set for each model to only two migration types (e.g., Model 1: only cases coded mig=0 and mig=1; Model 2: only cases coded mig=0 and mig=2; Model 3: only cases coded mig=1 and mig=2). Such a simple multilevel logistic regression model could be estimated with lme4 but this approach is less ideal because it does not appropriately account for the impact of the omitted cases. A second solution would be to run multinomial logistic multilevel models in MLWiN through R using the R2MLwiN package. But since MLWiN is not open source and the generated object difficult to use, I would prefer to avoid this option. Based on a comprehensive internet search there seem to be some demand for such models but I am not aware of a good R package. So it would be great if some experts who have run such models could provide a recommendation and if there are more than one package maybe indicate some advantages/disadvantages. I am sure that such information would be a very helpful resource for multiple R users. Thanks!!

最好, 拉斐尔

推荐答案

通常,有两种方法将带有J组的分类变量的多项式模型拟合:(1)同时估计J-1对比(2)为每种对比估计单独的logit模型.

There are generally two ways of fitting a multinomial models of a categorical variable with J groups: (1) Simultaneously estimating J-1 contrasts; (2) Estimating a separate logit model for each contrast.

这两种方法的结果相同吗? 否,但是结果通常相似

Produce these two methods the same results? No, but the results are often similar

哪种方法更好? 同时拟合更为精确(有关原因,请参见下文)

Which method is better? Simultaneously fitting is more precise (see below for an explanation why)

为什么有人会使用单独的logit模型? (1)lme4包没有用于同时拟合多项式模型的例程,并且没有其他多级R包可以做到这一点.因此,如果有人想估计R中的多级多项式模型,那么单独的logit模型是目前唯一可行的解​​决方案.(2)正如一些强大的统计学家所争论的那样(Begg and Gray,1984; Allison,1984,p.46-47),单独的logit模型具有更大的灵活性,因为它们可以为每种对比度独立指定模型方程式.

Why would someone use separate logit models then? (1) the lme4 package has no routine for simultaneously fitting multinomial models and there is no other multilevel R package that could do this. So separate logit models are presently the only practical solution if someone wants to estimate multilevel multinomial models in R. (2) As some powerful statisticians have argued (Begg and Gray, 1984; Allison, 1984, p. 46-47), separate logit models are much more flexible as they permit for the independent specification of the model equation for each contrast.

使用单独的logit模型合法吗? 是,有一些免责声明.该方法称为贝格和灰色近似". Begg and Gray(1984,p.16)指出,这种个体化方法非常有效".但是,这会造成一些效率损失,并且Begg和Grey逼近法会产生更大的标准误差(Agresti 2002,第274页).因此,使用这种方法很难获得明显的结果,并且可以认为结果是保守的.当参考类别较大时,效率损失最小(Begg和Gray,1984; Agresti,2002).采用Begg和Gray逼近(不是多级)的R包包括mlogitBMA(Sevcikova和Raftery,2012).

Is it legitimate to use separate logit models? Yes, with some disclaimers. This method is called the "Begg and Gray Approximation". Begg and Gray (1984, p. 16) showed that this "individualized method is highly efficient". However, there is some efficiency loss and the Begg and Gray Approximation produces larger standard errors (Agresti 2002, p. 274). As such, it is more difficult to obtain significant results with this method and the results can be considered conservative. This efficiency loss is smallest when the reference category is large (Begg and Gray, 1984; Agresti 2002). R packages that employ the Begg and Gray Approximation (not multilevel) include mlogitBMA (Sevcikova and Raftery, 2012).

为什么一系列单独的logit模型不精确? 在我的第一个示例中,我们有一个变量(migration),它可以具有三个值A(不迁移),B(内部迁移)和C(国际迁移).仅使用一个预测变量x(年龄),就可以将多项式模型参数化为一系列二项式对比,如下所示(Long和Cheng,2004第277页):

Why is a series of individual logit models imprecise? In my initial example we have a variable (migration) that can have three values A (no migration), B (internal migration), C (international migration). With only one predictor variable x (age), multinomial models are parameterized as a series of binomial contrasts as follows (Long and Cheng, 2004 p. 277):

Eq. 1:  Ln(Pr(B|x)/Pr(A|x)) = b0,B|A + b1,B|A (x) 
Eq. 2:  Ln(Pr(C|x)/Pr(A|x)) = b0,C|A + b1,C|A (x)
Eq. 3:  Ln(Pr(B|x)/Pr(C|x)) = b0,B|C + b1,B|C (x)

对于这些对比,以下等式必须成立:

For these contrasts the following equations must hold:

Eq. 4: Ln(Pr(B|x)/Pr(A|x)) + Ln(Pr(C|x)/Pr(A|x)) = Ln(Pr(B|x)/Pr(C|x))
Eq. 5: b0,B|A + b0,C|A = b0,B|C
Eq. 6: b1,B|A + b1,C|A = b1,B|C

问题在于这些等式(等式4-6)在实际中将不完全成立,因为系数是基于略有不同的样本估算的,因为仅使用了两个对比组的情况,而忽略了第三组的情况.同时估计多项式对比度的程序可确保等式. 4-6举行(Long and Cheng,2004 p.277).我不知道这种同时"的模型求解是如何工作的,也许有人可以提供解释吗?可同时拟合多层多项式模型的软件包括MLwiN(Steele 2013,第4页)和STATA(xlmlogit命令,Pope,2014).

The problem is that these equations (Eq. 4-6) will in praxis not hold exactly because the coefficients are estimated based on slightly different samples since only cases from the two contrasting groups are used und cases from the third group are omitted. Programs that simultaneously estimate the multinomial contrasts make sure that Eq. 4-6 hold (Long and Cheng, 2004 p. 277). I don’t know exactly how this "simultaneous" model solving works – maybe someone can provide an explanation? Software that do simultaneous fitting of multilevel multinomial models include MLwiN (Steele 2013, p. 4) and STATA (xlmlogit command, Pope, 2014).

参考文献:

References:

Agresti,A.(2002).分类数据分析(第二版).新泽西州霍博肯:约翰·威利&儿子.

Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken, NJ: John Wiley & Sons.

Allison,P.D.(1984).事件历史记录分析.加利福尼亚州千橡市:圣人出版社.

Allison, P. D. (1984). Event history analysis. Thousand Oaks, CA: Sage Publications.

Begg,C. B.,& Gray,R.(1984).使用个性化回归计算多项逻辑回归参数. Biometrika,71(1),11-18.

Begg, C. B., & Gray, R. (1984). Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika, 71(1), 11-18.

Long,S.J.,& Cheng,S.(2004年).分类结果的回归模型.在M. Hardy& A. Bryman(编辑),《数据分析手册》(第258-285页).伦敦:SAGE出版有限公司.

Long, S. J., & Cheng, S. (2004). Regression models for categorical outcomes. In M. Hardy & A. Bryman (Eds.), Handbook of data analysis (pp. 258-285). London: SAGE Publications, Ltd.

Pope,R.(2014年).聚焦:满足Stata的新xlmlogit命令. Stata新闻,29(2),2-3.

Pope, R. (2014). In the spotlight: Meet Stata's new xlmlogit command. Stata News, 29(2), 2-3.

Sevcikova,H.,& Raftery,A.(2012年).使用Begg& amp; amp;估计多项式logit模型.灰色近似.

Sevcikova, H., & Raftery, A. (2012). Estimation of multinomial logit model using the Begg & Gray approximation.

Steele,F.(2013年).模块10:用于名义响应概念的单层和多层模型.英国布里斯托尔:多层建模中心.

Steele, F. (2013). Module 10: Single-level and multilevel models for nominal responses concepts. Bristol, U.K,: Centre for Multilevel Modelling.

这篇关于R中的多项式逻辑多级模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆