缺少数据时,如何在模型中使用没有错误的`predict()`? [英] How to Use `predict()` without errors in a model when you have missing data?
问题描述
我有一个非常简单的逻辑回归模型,该模型仅基于 Race
和 Sex
中的两个分类预测变量.首先,由于我有一些缺失的值,为了确保所有缺失的数据以 NA
的形式输入,我使用以下命令导入数据框:
I have a pretty simply logistic regression model based solely on two categorical predictors in Race
and Sex
. Firstly, since I have some missing values, to make sure all the missing data comes in as NA
, I import the data frame using the following:
> mydata <- read.csv("~/Desktop/R/mydata.csv", sep=",", strip.white = TRUE,
+ na.strings= c("999", "NA", " ", ""))
以下是预测变量的摘要,以查看有多少 NA
:
Here's the summary of the predictors to see how many NA
s there are:
> # Define variables
>
> Y <- cbind(Support)
> X <- cbind(Race, Sex)
>
> summary(X)
Race Sex
Min. :1.000000 Min. :1.000000
1st Qu.:1.000000 1st Qu.:1.000000
Median :2.000000 Median :1.000000
Mean :1.608696 Mean :1.318245
3rd Qu.:2.000000 3rd Qu.:2.000000
Max. :3.000000 Max. :3.000000
NA's :420 NA's :42
由于缺少值,该模型似乎可以实现预期的结果:
The model seems to do what it's supposed to with no problems due to the missing values:
> # Logit model coefficients
>
> logit <- glm(Y ~ X, family=binomial (link = "logit"))
>
> summary(logit)
Call:
glm(formula = Y ~ X, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0826825 -1.0911146 0.6473451 1.0190080 1.7457212
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3457629 0.2884629 4.66529 3.0818e-06 ***
XRace -1.0716191 0.1339177 -8.00207 1.2235e-15 ***
XSex 0.5910812 0.1420270 4.16175 3.1581e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1434.5361 on 1057 degrees of freedom
Residual deviance: 1347.5684 on 1055 degrees of freedom
(420 observations deleted due to missingness)
AIC: 1353.5684
Number of Fisher Scoring iterations: 4
问题1:当我没有任何 NA
时,此代码似乎运行良好.但是,每当缺少值时,我都会收到一条错误消息.不管是否丢失数据,有没有办法仍然查看我有多少正确的预测值?
Question 1: When I don't have any NA
s, this code seems to work well. But I get an error message whenever there are missing values. Is there a way to still see how many correctly predicted values I have, regardless of missing data or not?
> table(true = Y, pred = round(fitted(logit)))
Error in table(true = Y, pred = round(fitted(logit))) :
all arguments must have the same length
在模型定义中添加 na.action = na.exclude
后,表格现在可以正常运行:
After adding na.action = na.exclude
to the model definition, the table now works perfectly:
pred
true 0 1
0 259 178
1 208 413
当我使用此代码时,不管丢失数据如何,仍然有效的某些方法正在将预测加载到原始数据帧上.它将正确地在数据帧的末尾添加带有每一行概率的"pred"列(并简单地添加 NA
,如果其中一个预测变量不存在):
Something that does still work, regardless of missing data, is loading the predictions onto the original data frame when I use this code. It correctly adds a 'pred' column at the end of the data frame with each row's probability (and simply adds an NA
instead if one of the predictors does not exist):
> predictions = cbind(mydata, pred = predict(logit, newdata = mydata, type = "response"))
> write.csv(predictions, "~/Desktop/R/predictions.csv", row.names = F)
问题2:但是,当我尝试预测一个新的数据帧时-尽管它具有相同的关注变量-似乎有关缺失值的某些内容会导致错误消息也一样是否有代码可以解决此问题,或者我做错了什么?
Question 2: However, when I try to predict into a new data frame -- even though it has the same variables of interest -- it seems like something about the missing values cause an error message as well. Is there code to get around this, or am I doing something incorrectly?
> newpredictions = cbind(newdata, pred = predict(logit, newdata = newdata, type = "response"))
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 1475, 1478
In addition: Warning message:
'newdata' had 1475 rows but variables found have 1478 rows
如上所述, mydata
中的行数为1,478, newdata
中的行数为1,475.
As you can see above, the number of rows in mydata
is 1,478 and the number of rows in newdata
is 1,475.
感谢您的帮助!
推荐答案
如果缺少数据, NA
s,R将在建模函数执行 formula
时将其删除.-> model.frame
-> model.matrix()
等,因为所有这些功能的默认设置都是 na.action = na.omit
.换句话说,在执行实际计算之前,将删除带有 NA
的行.这种删除会传播到从模型对象访问的拟合值,残差等
If you have missing data, NA
s, R will strip these when the modelling functions does formula
-> model.frame
-> model.matrix()
etc., because the default in all these functions is to have na.action = na.omit
. In other words, rows with NA
s are deleted before the actual computations are performed. This deletion propagates through to the fitted values, residuals etc that are accessed from the model object
意识到这是一个问题,R还有其他 na.action
函数,包括 na.exclude()
.因此,如果您添加
Realising this is an issue, R has other na.action
functions, including na.exclude()
. Hence if you add
na.action = na.exclude
调用 glm()
, fitted()
, resid()
等时,将返回尽可能多的拟合值放置数据中的行.
to your call to glm()
, fitted()
, resid()
, etc would return as many fitted values as you have rows in your put data.
您似乎确实想以一种特殊的方式进行建模.为什么要从您的 mydata
对象创建 X
和 Y
?做起来远更好
You do seem to be going about modelling in a peculiar way. Why are you creating X
and Y
, presumably from your mydata
object? It would be far better to do
mod <- glm(Support ~ Race + Sex, data = mydata family = binomial,
na.action = na.exclude)
现在,这里有了代替匿名 X
和 Y
的变量,这些变量意味着某些含义,而您不必创建重复的数据.
where now instead of the anonymous X
and Y
we have variables that mean something, and you haven't had to create duplicate data.
这篇关于缺少数据时,如何在模型中使用没有错误的`predict()`?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!