在测试数据(r)(错误)中具有三个新类别的predict.glm() [英] predict.glm() with three new categories in the test data (r)(error)
问题描述
我有一个名为data
的数据集,它有481092行.
I have a data set called data
which has 481 092 rows.
我将data
分为两个相等的部分:
I split data
into two equal halves:
- 前半部分(第1行:240 546)称为
train
,用于glm()
; - 后半部分(行240 547:481 092)称为
test
,应用于验证模型;
- The first halve (row 1: 240 546) is called
train
and was used for theglm()
; - the second halve (row 240 547 : 481 092) is called
test
and should be used to validate the model;
然后我开始回归:
testreg <- glm(train$returnShipment ~ train$size + train$color + train$price +
train$manufacturerID + train$salutation + train$state +
train$age + train$deliverytime,
family=binomial(link="logit"), data=train)
现在的预测:
prediction <- predict.glm(testreg, newdata=test, type="response")
给我一个错误:
Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137
现在我知道这些水平在回归分析中被省略了,因为它没有显示这些水平的任何系数.
Now I know that these levels were omitted in the regression because it doesn't show any coefficients for these levels.
我已经尝试过: predict.lm( )的测试数据中的因子水平未知.但是它某种程度上对我不起作用,或者我可能只是不知道如何实现它.我想预测因变量,但当然只能使用现有系数.上面的链接建议告诉R具有新级别的行应仅被称为/或被视为NA.
I have tried this: predict.lm() with an unknown factor level in test data . But it somehow doesn't work for me or I maybe just don't get how to implement it. I want to predict the dependent binary variable but of course only with the existing coefficients. The link above suggests to tell R that rows with new levels should just be called /or treated as NA.
我该如何进行?
Z.Li建议的编辑方法
我在第一步中遇到了问题:
I got problem in the first step:
xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]
,但是mID125
是NULL
!我做错了什么?
but mID125
is NULL
! What have I done wrong?
推荐答案
在固定效应建模中,不可能获得新因子水平的估计,包括线性模型和广义线性模型. glm
(以及lm
)保留在模型拟合期间显示和使用哪些因子水平的记录,并且可以在testreg$xlevels
中找到.
It is impossible to get estimation of new factor levels, in fixed effect modelling, including linear models and generalized linear models. glm
(as well as lm
) keeps records of what factor levels are presented and used during model fitting, and can be found in testreg$xlevels
.
您用于模型估计的模型公式为:
Your model formula for model estimation is:
returnShipment ~ size + color + price + manufacturerID + salutation +
state + age + deliverytime
然后,predict
抱怨manufactureID
的新因子水平125、136、137.这意味着这些级别不在testreg$xlevels$manufactureID
内部,因此没有关联的预测系数.在这种情况下,我们必须删除此因子变量并使用预测公式:
then predict
complains new factor levels 125, 136, 137 for manufactureID
. This means, these levels are not inside testreg$xlevels$manufactureID
, therefore has no associated coefficient for prediction. In this case, we have to drop this factor variable and use a prediction formula:
returnShipment ~ size + color + price + salutation +
state + age + deliverytime
但是,标准predict
例程无法采用您的自定义预测公式.通常有两种解决方案:
However, the standard predict
routine can not take your customized prediction formula. There are commonly two solutions:
- 从
testreg
中提取模型矩阵和模型系数,并通过矩阵矢量乘法手动预测所需的模型项.这就是您中提供的链接建议去做; - 将
test
中的因子水平重置为testreg$xlevels$manufactureID
中出现的任何一个水平,例如testreg$xlevels$manufactureID[1]
.因此,我们仍然可以使用标准的predict
进行预测.
- extract model matrix and model coefficients from
testreg
, and manually predict model terms we want by matrix-vector multiplication. This is what the link given in your post suggests to do; - reset the factor levels in
test
into any one level appeared intestreg$xlevels$manufactureID
, for example,testreg$xlevels$manufactureID[1]
. As such, we can still use the standardpredict
for prediction.
现在,让我们首先获取用于模型拟合的因子水平
Now, let's first pick up a factor level used for model fitting
xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]
然后,我们将此级别分配给您的预测数据:
Then we assign this level to your prediction data:
replacement <- factor(rep(mID125, length = nrow(test)), levels = xlevels)
test$manufacturerID <- replacement
我们已经可以预测:
pred <- predict(testreg, test, type = "link") ## don't use type = "response" here!!
最后,我们通过减去因子估计来调整此线性预测变量:
In the end, we adjust this linear predictor, by subtracting factor estimate:
est <- coef(testreg)[paste0(manufacturerID, mID125)]
pred <- pred - est
最后,如果您想按原始比例进行预测,则可以应用链接函数的逆函数:
Finally, if you want prediction on the original scale, you apply the inverse of link function:
testreg$family$linkinv(pred)
更新:
您抱怨在尝试上述解决方案时遇到了种种麻烦.这就是为什么.
You complained that you met various troubles in trying the above solutions. Here is why.
您的代码:
testreg <- glm(train$returnShipment~ train$size + train$color +
train$price + train$manufacturerID + train$salutation +
train$state + train$age + train$deliverytime,
family=binomial(link="logit"), data=train)
是指定模型公式的非常糟糕的方法. train$returnShipment
等将严格限制仅在数据帧train
中获取变量的环境,并且以后在使用其他数据集(例如test
)进行预测时会遇到麻烦.
is a very bad way to specify your model formula. train$returnShipment
, etc, will restrict the environment of getting variables strictly to data frame train
, and you will have trouble in later prediction with other data sets, like test
.
作为解决此类缺陷的简单示例,我们模拟了一些玩具数据并拟合了GLM:
As a simple example for such drawback, we simulate some toy data and fit a GLM:
set.seed(0); y <- rnorm(50, 0, 1)
set.seed(0); a <- sample(letters[1:4], 50, replace = TRUE)
foo <- data.frame(y = y, a = factor(a))
toy <- glm(foo$y ~ foo$a, data = foo) ## bad style
> toy$formula
foo$y ~ foo$a
> toy$xlevels
$`foo$a`
[1] "a" "b" "c" "d"
现在,我们看到所有内容都带有前缀foo$
.在预测期间:
Now, we see everything comes with a prefix foo$
. During prediction:
newdata <- foo[1:2, ] ## take first 2 rows of "foo" as "newdata"
rm(foo) ## remove "foo" from R session
predict(toy, newdata)
我们收到一个错误:
eval(expr,envir,enclos)中的错误:找不到对象'foo'
Error in eval(expr, envir, enclos) : object 'foo' not found
好的样式是指定从函数的data
参数获取数据的环境:
The good style is to specify environment of getting data from data
argument of the function:
foo <- data.frame(y = y, a = factor(a))
toy <- glm(y ~ a, data = foo)
然后foo$
消失.
> toy$formula
y ~ a
> toy$xlevels
$a
[1] "a" "b" "c" "d"
这将解释两件事:
- 您在评论中向我抱怨说,当您执行
testreg$xlevels$manufactureID
时,您会得到NULL
; -
您发布的预测错误
- You complained to me in the comment that when you do
testreg$xlevels$manufactureID
, you getNULL
; The prediction error you posted
Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137
抱怨train$manufacturerID
而不是test$manufacturerID
.
这篇关于在测试数据(r)(错误)中具有三个新类别的predict.glm()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!