R中的Predict.lm无法识别新数据 [英] Predict.lm in R fails to recognize newdata

查看:95
本文介绍了R中的Predict.lm无法识别新数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行线性回归,其中将预测变量按另一个值分类,并且在生成新数据的建模响应时遇到了麻烦.

I'm running a linear regression where the predictor is categorized by another value and am having trouble generating modeled responses for newdata.

首先,我为预测变量和误差项生成一些随机值.然后,我构造响应.请注意,预测变量的系数取决于分类变量的值.我根据预测变量及其类别组成了一个设计矩阵.

First, I generate some random values for the predictor and the error terms. I then construct the response. Note that the predictor's coefficient depends on the value of a categorical variable. I compose a design matrix based on the predictor and its category.

set.seed(1)

category = c(rep("red", 5), rep("blue",5))
x1 = rnorm(10, mean = 1, sd = 1)
err = rnorm(10, mean = 0, sd = 1)

y = ifelse(category == "red", x1 * 2, x1 * 3)
y = y + err

df = data.frame(x1 = x1, category = category)

dm = as.data.frame(model.matrix(~ category + 0, data = df))
dm = dm * df$x1

fit = lm(y ~ as.matrix(dm) + 0, data = df)

# This line will not produce a warning
predictOne = predict.lm(fit, newdata = dm)

# This line WILL produce a warning
predictTwo = predict.lm(fit, newdata = dm[1:5,])

警告是:

'newdata'有5行,但是找到的变量有10行

'newdata' had 5 rows but variable(s) found have 10 rows

除非我非常误解,否则变量名称应该没有任何问题. (此板上有一个或两个讨论提示了这个问题.)请注意,第一个预测运行良好,但是第二个则没有.唯一的变化是,第二个预测仅使用设计矩阵的前五行.

Unless I'm very much mistaken, I shouldn't have any issues with the variable names. (There are one or two discussions on this board which suggest that issue.) Note that the first prediction runs fine, but the second does not. The only change is that the second prediction uses only the first five rows of the design matrix.

有想法吗?

推荐答案

我不确定100%确定您要做什么,但是我认为简短介绍一下公式的工作原理将为您解决问题

I'm not 100% sure what you're trying to do, but I think a short walk-through of how formulas work will clear things up for you.

基本思想很简单:您传递两件事,一个公式和一个数据框.公式中的术语应全部为数据框中变量的名称.

The basic idea is very simple: you pass two things, a formula and a data frame. The terms in the formula should all be names of variables in your data frame.

现在,无需完全遵循该准则,您就可以使lm正常工作,但是您只是在问问题.因此,停下来看看您的模型规格,并考虑R在哪里寻找东西.

Now, you can get lm to work without following that guideline exactly, but you're just asking for things to go wrong. So stop and look at your model specifications and think about where R is looking for things.

当您调用lm时,基本上在数据框df中实际上找不到公式中的任何名称.所以我怀疑df根本没有被使用.

When you call lm basically none of the names in your formula are actually found in the data frame df. So I suspect that df isn't being used at all.

然后,如果您调用model.frame(fit),您将看到R认为应该调用变量的含义.注意到有什么奇怪的东西吗?

Then if you call model.frame(fit) you'll see what R thinks your variables should be called. Notice anything strange?

model.frame(fit)
            y as.matrix(dm).categoryblue as.matrix(dm).categoryred
1   2.2588735                  0.0000000                 0.3735462
2   2.7571299                  0.0000000                 1.1836433
3  -0.2924978                  0.0000000                 0.1643714
4   2.9758617                  0.0000000                 2.5952808
5   3.7839465                  0.0000000                 1.3295078
6   0.4936612                  0.1795316                 0.0000000
7   4.4460969                  1.4874291                 0.0000000
8   6.1588103                  1.7383247                 0.0000000
9   5.5485653                  1.5757814                 0.0000000
10  2.6777362                  0.6946116                 0.0000000

dm中是否有任何叫做as.matrix(dm).categoryblue的东西?是的,我不这么认为.

Is there anything called as.matrix(dm).categoryblue in dm? Yeah, I didn't think so.

我怀疑(但不确定)您打算做更多这样的事情:

I suspect (but am not sure) that you meant to do something more like this:

df$y <- y
fit <- lm(y~category - 1,data = df)

这篇关于R中的Predict.lm无法识别新数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆