在测试数据(r)(错误)中具有三个新类别的predict.glm() [英] predict.glm() with three new categories in the test data (r)(error)

查看:229
本文介绍了在测试数据(r)(错误)中具有三个新类别的predict.glm()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个名为data的数据集,它有481092行.

I have a data set called data which has 481 092 rows.

我将data分为两个相等的部分:

I split data into two equal halves:

  1. 前半部分(第1行:240 546)称为train,用于glm();
  2. 后半部分(行240 547:481 092)称为test,应用于验证模型;
  1. The first halve (row 1: 240 546) is called train and was used for the glm();
  2. the second halve (row 240 547 : 481 092) is called test and should be used to validate the model;

然后我开始回归:

testreg <- glm(train$returnShipment ~ train$size + train$color + train$price + 
               train$manufacturerID + train$salutation + train$state +
               train$age + train$deliverytime, 
               family=binomial(link="logit"), data=train)

现在的预测:

prediction <- predict.glm(testreg, newdata=test, type="response")

给我一​​个错误:

Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137

现在我知道这些水平在回归分析中被省略了,因为它没有显示这些水平的任何系数.

Now I know that these levels were omitted in the regression because it doesn't show any coefficients for these levels.

我已经尝试过: predict.lm( )的测试数据中的因子水平未知.但是它某种程度上对我不起作用,或者我可能只是不知道如何实现它.我想预测因变量,但当然只能使用现有系数.上面的链接建议告诉R具有新级别的行应仅被称为/或被视为NA.

I have tried this: predict.lm() with an unknown factor level in test data . But it somehow doesn't work for me or I maybe just don't get how to implement it. I want to predict the dependent binary variable but of course only with the existing coefficients. The link above suggests to tell R that rows with new levels should just be called /or treated as NA.

我该如何进行?

Z.Li建议的编辑方法

我在第一步中遇到了问题:

I got problem in the first step:

xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]

,但是mID125NULL!我做错了什么?

but mID125 is NULL! What have I done wrong?

推荐答案

固定效应建模中,不可能获得新因子水平的估计,包括线性模型和广义线性模型. glm(以及lm)保留在模型拟合期间显示和使用哪些因子水平的记录,并且可以在testreg$xlevels中找到.

It is impossible to get estimation of new factor levels, in fixed effect modelling, including linear models and generalized linear models. glm (as well as lm) keeps records of what factor levels are presented and used during model fitting, and can be found in testreg$xlevels.

您用于模型估计的模型公式为:

Your model formula for model estimation is:

returnShipment ~ size + color + price + manufacturerID + salutation + 
                 state + age + deliverytime

然后,predict抱怨manufactureID的新因子水平125、136、137.这意味着这些级别不在testreg$xlevels$manufactureID内部,因此没有关联的预测系数.在这种情况下,我们必须删除此因子变量并使用预测公式:

then predict complains new factor levels 125, 136, 137 for manufactureID. This means, these levels are not inside testreg$xlevels$manufactureID, therefore has no associated coefficient for prediction. In this case, we have to drop this factor variable and use a prediction formula:

returnShipment ~ size + color + price + salutation + 
                 state + age + deliverytime

但是,标准predict例程无法采用您的自定义预测公式.通常有两种解决方案:

However, the standard predict routine can not take your customized prediction formula. There are commonly two solutions:

  1. testreg中提取模型矩阵和模型系数,并通过矩阵矢量乘法手动预测所需的模型项.这就是您中提供的链接建议去做;
  2. test中的因子水平重置为testreg$xlevels$manufactureID中出现的任何一个水平,例如testreg$xlevels$manufactureID[1].因此,我们仍然可以使用标准的predict进行预测.
  1. extract model matrix and model coefficients from testreg, and manually predict model terms we want by matrix-vector multiplication. This is what the link given in your post suggests to do;
  2. reset the factor levels in test into any one level appeared in testreg$xlevels$manufactureID, for example, testreg$xlevels$manufactureID[1]. As such, we can still use the standard predict for prediction.

现在,让我们首先获取用于模型拟合的因子水平

Now, let's first pick up a factor level used for model fitting

xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]

然后,我们将此级别分配给您的预测数据:

Then we assign this level to your prediction data:

replacement <- factor(rep(mID125, length = nrow(test)), levels = xlevels)
test$manufacturerID <- replacement

我们已经可以预测:

pred <- predict(testreg, test, type = "link")  ## don't use type = "response" here!!

最后,我们通过减去因子估计来调整此线性预测变量:

In the end, we adjust this linear predictor, by subtracting factor estimate:

est <- coef(testreg)[paste0(manufacturerID, mID125)]
pred <- pred - est

最后,如果您想按原始比例进行预测,则可以应用链接函数的逆函数:

Finally, if you want prediction on the original scale, you apply the inverse of link function:

testreg$family$linkinv(pred)


更新:

您抱怨在尝试上述解决方案时遇到了种种麻烦.这就是为什么.

You complained that you met various troubles in trying the above solutions. Here is why.

您的代码:

testreg <- glm(train$returnShipment~ train$size + train$color + 
               train$price + train$manufacturerID + train$salutation + 
               train$state + train$age + train$deliverytime,
               family=binomial(link="logit"), data=train)

是指定模型公式的非常糟糕的方法. train$returnShipment等将严格限制仅在数据帧train中获取变量的环境,并且以后在使用其他数据集(例如test)进行预测时会遇到麻烦.

is a very bad way to specify your model formula. train$returnShipment, etc, will restrict the environment of getting variables strictly to data frame train, and you will have trouble in later prediction with other data sets, like test.

作为解决此类缺陷的简单示例,我们模拟了一些玩具数据并拟合了GLM:

As a simple example for such drawback, we simulate some toy data and fit a GLM:

set.seed(0); y <- rnorm(50, 0, 1)
set.seed(0); a <- sample(letters[1:4], 50, replace = TRUE)
foo <- data.frame(y = y, a = factor(a))
toy <- glm(foo$y ~ foo$a, data = foo)  ## bad style

> toy$formula
foo$y ~ foo$a  
> toy$xlevels
$`foo$a`
[1] "a" "b" "c" "d"

现在,我们看到所有内容都带有前缀foo$.在预测期间:

Now, we see everything comes with a prefix foo$. During prediction:

newdata <- foo[1:2, ]  ## take first 2 rows of "foo" as "newdata"
rm(foo)  ## remove "foo" from R session
predict(toy, newdata)

我们收到一个错误:

eval(expr,envir,enclos)中的错误:找不到对象'foo'

Error in eval(expr, envir, enclos) : object 'foo' not found

好的样式是指定从函数的data参数获取数据的环境:

The good style is to specify environment of getting data from data argument of the function:

foo <- data.frame(y = y, a = factor(a))
toy <- glm(y ~ a, data = foo)

然后foo$消失.

> toy$formula
y ~ a
> toy$xlevels
$a
[1] "a" "b" "c" "d"

这将解释两件事:

  1. 您在评论中向我抱怨说,当您执行testreg$xlevels$manufactureID时,您会得到NULL;
  2. 您发布的预测错误

  1. You complained to me in the comment that when you do testreg$xlevels$manufactureID, you get NULL;
  2. The prediction error you posted

Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137

抱怨train$manufacturerID而不是test$manufacturerID.

这篇关于在测试数据(r)(错误)中具有三个新类别的predict.glm()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆