lme4 :: lmer报告“固定效应模型矩阵秩不足",我是否需要修复以及如何解决? [英] lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to?

查看:361
本文介绍了lme4 :: lmer报告“固定效应模型矩阵秩不足",我是否需要修复以及如何解决?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试运行一个混合效果模型,该模型使用其他列作为预测变量来预测F2_difference,但是我收到一条错误消息,指出

I am trying to run a mixed-effects model that predicts F2_difference with the rest of the columns as predictors, but I get an error message that says

固定效果模型矩阵的秩不足,因此删除了7列/系数.

fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients.

通过此链接,固定效果模型的等级不足 ,我认为我应该在R包caret中使用findLinearCombos.但是,当我尝试findLinearCombos(data.df)时,它会给我错误消息

From this link, Fixed-effects model is rank deficient, I think I should use findLinearCombos in the R package caret. However, when I try findLinearCombos(data.df), it gives me the error message

qr.default(object)中的错误:外部函数调用(参数1)中的NA/NaN/Inf 另外:警告消息: 在qr.default(object)中:强制引入的NAs

Error in qr.default(object) : NA/NaN/Inf in foreign function call (arg 1) In addition: Warning message: In qr.default(object) : NAs introduced by coercion

我的数据没有任何NA-可能是什么原因造成的? (很抱歉,答案很明显-我是R的新手.)

My data does not have any NAs - What could be causing this? (Sorry if the answer is various obvious - I am new to R).

我所有的数据都是因素,除了我试图预测的数值.这是我的数据的一小部分.

All of my data are factors except the numerical value that I am trying to predict. Here is a small sample of my data.

sex <- c("f", "m", "f", "m")
nasal <- c("TRUE", "TRUE", "FALSE", "FALSE")
vowelLabel <- c("a", "e", "i", "o")
speaker <- c("Jim", "John", "Ben", "Sally")
word_1 <- c("going", "back", "bag", "back")
type <- c("coronal", "coronal", "labial", "velar")
F2_difference <- c(345.6, -765.8, 800, 900.5)
data.df <- data.frame(sex, nasal, vowelLabel, speaker,
                      word_1, type, F2_difference
                      stringsAsFactors = TRUE)

如果有帮助,这里还有更多代码.

Here is some more code, if it helps.

formula <- F2_difference ~ sex + nasal + type + vowelLabel + 
           type * vowelLabel + nasal * type +
           (1|speaker) + (1|word_1)

lmer(formula, REML = FALSE, data = data.df)

编辑器

OP没有提供足够数量的测试数据,无法允许读者在lmer中实际运行模型.但这不是一个太大的问题.这仍然是一个很好的帖子!

The OP did not provide sufficient number of test data to allow an actual run of the model in lmer for the reader. But this is not too big a issue. This is still a very good post!

推荐答案

您对警告消息过于关注:

You are slightly over-concerned with the warning message:

固定效果模型矩阵的秩不足,因此删除了7列/系数.

fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients.

这是警告,不是错误.既不会滥用lmer,也不会误认为模型公式,因此您将获得一个估计的模型.但为回答您的问题,我将尽力予以解释.

It is a warning not an error. There is neither misuse of lmer nor ill-specification of model formula, thus you will obtain an estimated model. But to answer your question, I shall strive to explain it.

在执行lmer期间,您的模型公式分为固定效果公式随机效应公式,对于每个模型矩阵已构建.固定的是通过标准模型矩阵构造器model.matrix构造的;随机变量的构造很复杂,但与您的问题无关,所以我就跳过它.

During execution of lmer, your model formula is broken into a fixed effect formula and a random effect formula, and for each a model matrix is constructed. Construction for the fixed one is via the standard model matrix constructor model.matrix; construction for the random one is complicated but not related to your question, so I just skip it.

对于您的模型,您可以通过以下方法检查固定效果模型矩阵的外观:

For your model, you can check what the fixed effect model matrix looks like by:

fix.formula <- F2_difference ~ sex + nasal + type + vowelLabel + 
               type * vowelLabel + nasal * type

X <- model.matrix (fix.formula, data.df)

所有变量都是因数,因此X将是二进制的.尽管 model.matrixcontrasts应用于每个因素及其相互作用,但X仍可能没有结束与完整的列排名一致,因为列可能是其他一些列的线性组合(可以是精确的或数字上接近的).就您而言,一个因素的某些级别可能嵌套在另一个因素的某些级别.

All your variables are factor so X will be binary. Though model.matrix applies contrasts for each factor and their interaction, it is still possible that X does not end up with full column rank, as a column may be a linear combination of some others (which can either be precise or numerically close). In your case, some levels of one factor may be nested in some levels of another.

等级不足可以通过许多不同的方式出现. 另一个答案分享了一个CrossValidated答案,其中提供了大量讨论,对此我将发表一些评论.

Rank deficiency can arise in many different ways. The other answer shares a CrossValidated answer offering substantial discussions, on which I will make some comments.

  • 对于情况1,人们实际上可以通过例如LASSO建立特征选择模型.
  • 情况2和3与数据收集过程有关.良好的实验设计是防止排名不足的最佳方法,但是对于许多构建模型的人来说,数据已经存在并且无法改善(如获取更多数据).但是,我想强调的是,即使对于没有排名不足的数据集,如果我们不小心使用它,我们仍然会遇到这个问题.例如,交叉验证是进行模型比较的好方法.为此,我们需要将完整的数据集分为一个训练数据集和一个测试数据集,但是如果不加注意,我们可能会从训练数据集中得到一个秩不足的模型.
  • 案例4是一个大问题,可能完全无法控制.也许自然的选择是减少模型的复杂性,但另一种选择是尝试惩罚式回归.
  • 案例5是一个数字问题,导致数字排名不足,并且是一个很好的例子.
  • 案例6和7说明了以有限的精度执行数值计算的事实.通常,如果情况5得到正确处理,这些就不会成为问题.
  • For case 1, people can actually do a feature selection model via say, LASSO.
  • Cases 2 and 3 are related to the data collection process. A good design of experiment is the best way to prevent rank-deficiency, but for many people who build models, the data are already there and no improvement (like getting more data) is possible. However, I would like to stress that even for a dataset without rank-deficiency, we can still get this problem if we don't use it carefully. For example, cross-validation is a good method for model comparison. To do this we need to split the complete dataset into a training one and a test one, but without care we may get a rank-deficient model from the training dataset.
  • Case 4 is a big problem that could be completely out of our control. Perhaps a natural choice is to reduce model complexity, but an alternative is to try penalized regression.
  • Case 5 is a numerical concern leading to numerical rank-deficiency and this is a good example.
  • Cases 6 and 7 tell the fact that numerical computations are performed in finite precision. Usually these won't be an issue if case 5 is dealt with properly.

因此,有时我们可以解决缺陷,但并非总是可以实现此缺陷.因此,任何写得很好的模型拟合例程,例如lmglmmgcv::gam,都将对X应用QR分解以仅使用其全秩子空间,即X的最大子集. s列为评估提供了一个完整的空格,与其余列相关联的固定系数为0或NA .您得到的警告只是暗示了这一点.最初有ncol(X)个系数需要估算,但由于不足,仅会估算ncol(X) - 7,其余均为0或NA.这种数值解决方法可确保以最稳定的方式获得最小二乘解.

So, sometimes we can workaround the deficiency but it is not always possible to achieve this. Thus, any well-written model fitting routine, like lm, glm, mgcv::gam, will apply QR decomposition for X to only use its full-rank subspace, i.e., a maximum subset of X's columns that gives a full-rank space, for estimation, fixing coefficients associated with the rest of the columns at 0 or NA. The warning you got just implies this. There are originally ncol(X) coefficients to estimate, but due to deficiency, only ncol(X) - 7 will be estimated, with the rest being 0 or NA. Such numerical workaround ensures that a least squares solution can be obtained in the most stable manner.

为了更好地解决此问题,可以使用lm使用fix.formula拟合线性模型.

To better digest this issue, you can use lm to fit a linear model with fix.formula.

fix.fit <- lm(fix.formula, data.df, method = "qr", singular.ok = TRUE)

method = "qr"singular.ok = TRUE是默认设置,因此实际上我们不需要设置它.但是,如果我们指定singular.ok = FALSE,则lm将停止并抱怨等级不足.

method = "qr" and singular.ok = TRUE are default, so actually we don't need to set it. But if we specify singular.ok = FALSE, lm will stop and complain about rank-deficiency.

lm(fix.formula, data.df, method = "qr", singular.ok = FALSE)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
#  singular fit encountered

然后您可以在fix.fit中检查返回的值.

You can then check the returned values in fix.fit.

p <- length(coef)
coef <- fix.fit$coef
no.NA <- sum(is.na(coef))
rank <- fix.fit$rank

可以保证p = ncol(X),但是您应该看到no.NA = 7rank + no.NA = p.

It is guaranteed that p = ncol(X), but you should see no.NA = 7 and rank + no.NA = p.

lmer内部确实发生了同样的事情. lm不会报告缺陷,而lmer会报告缺陷.实际上,这很有意义,我经常看到有人问为什么lm对于某些系数返回NA.

Exactly the same thing happens inside lmer. lm will not report deficiency while lmer does. This is in fact informative, as too often, I see people asking why lm returns NA for some coefficients.

更新1(2016-05-07):

让我看看我是否有这个权利:简短的版本是我的一个预测变量与另一个变量相关联,但我不必担心.使用因素是否合适,对吗?而且我仍然可以将模型与anova或通过查看BIC进行比较?

Let me see if I have this right: The short version is that one of my predictor variables is correlated with another, but I shouldn't worry about it. It is appropriate to use factors, correct? And I can still compare models with anova or by looking at the BIC?

不用担心使用summaryanova.编写方法时应使用正确数量的参数(自由度)来生成有效的摘要统计信息.

Don't worry about the use of summary or anova. Methods are written so that the correct number of parameters (degree of freedom) will be used to produce valid summary statistics.

更新2(2016-11-06):

让我们还听听lme4的软件包作者会怎么说:等级不足警告混合模型lmer .本·博克(Ben Bolker)也提到了caret::findLinearCombos,特别是因为那里的OP希望自己解决缺陷问题.

Let's also hear what package author of lme4 would say: rank deficiency warning mixed model lmer. Ben Bolker has mentioned caret::findLinearCombos, too, particularly because the OP there want to address deficiency issue himself.

更新3(2018-07-27):

排位不足对于有效的模型估计和比较而言不是问题,但可能会对预测造成危害.我最近在CrossValidated上用模拟示例撰写了详细的答案: R lm,有人可以给我一个关于是的,理论上,我们应该避免排名不足的估算.但是实际上,没有所谓的真实模型" :我们试图从数据中学习.我们永远无法将估计的模型与真相"相提并论.最好的选择是从我们构建的许多模型中选择最好的一种.因此,如果最佳"模型最终出现排名不足的情况,我们可能会对此表示怀疑,但可能无法立即采取任何行动.

Rank-deficiency is not a problem for valid model estimation and comparison, but could be a hazard in prediction. I recently composed a detailed answer with simulated examples on CrossValidated: R lm, Could anyone give me an example of the misleading case on "prediction from a rank-deficient"? So, yes, in theory we should avoid rank-deficient estimation. But in reality, there is no so-called "true model": we try to learn it from data. We can never compare an estimated model to "truth"; the best bet is to choose the best one from a number of models we've built. So if the "best" model ends up rank-deficient, we can be skeptical about it but probably there is nothing we can do immediately.

这篇关于lme4 :: lmer报告“固定效应模型矩阵秩不足",我是否需要修复以及如何解决?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆