缺少数据时,如何在模型中使用没有错误的`predict()`? [英] How to Use `predict()` without errors in a model when you have missing data?

查看:60
本文介绍了缺少数据时,如何在模型中使用没有错误的`predict()`?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常简单的逻辑回归模型,该模型仅基于 Race Sex 中的两个分类预测变量.首先,由于我有一些缺失的值,为了确保所有缺失的数据以 NA 的形式输入,我使用以下命令导入数据框:

I have a pretty simply logistic regression model based solely on two categorical predictors in Race and Sex. Firstly, since I have some missing values, to make sure all the missing data comes in as NA, I import the data frame using the following:

> mydata <- read.csv("~/Desktop/R/mydata.csv", sep=",", strip.white = TRUE,
+                    na.strings= c("999", "NA", " ", ""))

以下是预测变量的摘要,以查看有多少 NA :

Here's the summary of the predictors to see how many NAs there are:

> # Define variables 
> 
> Y <- cbind(Support)
> X <- cbind(Race, Sex)
>
> summary(X) 
      Race               Sex          
 Min.   :1.000000   Min.   :1.000000  
 1st Qu.:1.000000   1st Qu.:1.000000  
 Median :2.000000   Median :1.000000  
 Mean   :1.608696   Mean   :1.318245  
 3rd Qu.:2.000000   3rd Qu.:2.000000  
 Max.   :3.000000   Max.   :3.000000  
 NA's   :420        NA's   :42 

由于缺少值,该模型似乎可以实现预期的结果:

The model seems to do what it's supposed to with no problems due to the missing values:

> # Logit model coefficients 
> 
> logit <- glm(Y ~ X, family=binomial (link = "logit")) 
> 
> summary(logit) 

Call:
glm(formula = Y ~ X, family = binomial(link = "logit"))

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-2.0826825  -1.0911146   0.6473451   1.0190080   1.7457212  

Coefficients:
              Estimate Std. Error  z value   Pr(>|z|)    
(Intercept)  1.3457629  0.2884629  4.66529 3.0818e-06 ***
XRace       -1.0716191  0.1339177 -8.00207 1.2235e-15 ***
XSex         0.5910812  0.1420270  4.16175 3.1581e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1434.5361  on 1057  degrees of freedom
Residual deviance: 1347.5684  on 1055  degrees of freedom
  (420 observations deleted due to missingness)
AIC: 1353.5684

Number of Fisher Scoring iterations: 4

问题1:当我没有任何 NA 时,此代码似乎运行良好.但是,每当缺少值时,我都会收到一条错误消息.不管是否丢失数据,有没有办法仍然查看我有多少正确的预测值?

Question 1: When I don't have any NAs, this code seems to work well. But I get an error message whenever there are missing values. Is there a way to still see how many correctly predicted values I have, regardless of missing data or not?

> table(true = Y, pred = round(fitted(logit))) 
Error in table(true = Y, pred = round(fitted(logit))) : 
all arguments must have the same length

在模型定义中添加 na.action = na.exclude 后,表格现在可以正常运行:

After adding na.action = na.exclude to the model definition, the table now works perfectly:

        pred 

true   0    1

  0   259  178 

  1   208  413

当我使用此代码时,不管丢失数据如何,仍然有效的某些方法正在将预测加载到原始数据帧上.它将正确地在数据帧的末尾添加带有每一行概率的"pred"列(并简单地添加 NA ,如果其中一个预测变量不存在):

Something that does still work, regardless of missing data, is loading the predictions onto the original data frame when I use this code. It correctly adds a 'pred' column at the end of the data frame with each row's probability (and simply adds an NA instead if one of the predictors does not exist):

> predictions = cbind(mydata, pred = predict(logit, newdata = mydata, type = "response"))
> write.csv(predictions, "~/Desktop/R/predictions.csv", row.names = F)

问题2:但是,当我尝试预测一个新的数据帧时-尽管它具有相同的关注变量-似乎有关缺失值的某些内容会导致错误消息也一样是否有代码可以解决此问题,或者我做错了什么?

Question 2: However, when I try to predict into a new data frame -- even though it has the same variables of interest -- it seems like something about the missing values cause an error message as well. Is there code to get around this, or am I doing something incorrectly?

> newpredictions = cbind(newdata, pred = predict(logit, newdata = newdata, type = "response"))
Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 1475, 1478
In addition: Warning message:
'newdata' had 1475 rows but variables found have 1478 rows 

如上所述, mydata 中的行数为1,478, newdata 中的行数为1,475.

As you can see above, the number of rows in mydata is 1,478 and the number of rows in newdata is 1,475.

感谢您的帮助!

推荐答案

如果缺少数据, NA s,R将在建模函数执行 formula 时将其删除.-> model.frame -> model.matrix()等,因为所有这些功能的默认设置都是 na.action = na.omit .换句话说,在执行实际计算之前,将删除带有 NA 的行.这种删除会传播到从模型对象访问的拟合值,残差等

If you have missing data, NAs, R will strip these when the modelling functions does formula -> model.frame -> model.matrix() etc., because the default in all these functions is to have na.action = na.omit. In other words, rows with NAs are deleted before the actual computations are performed. This deletion propagates through to the fitted values, residuals etc that are accessed from the model object

意识到这是一个问题,R还有其他 na.action 函数,包括 na.exclude().因此,如果您添加

Realising this is an issue, R has other na.action functions, including na.exclude(). Hence if you add

na.action = na.exclude

调用 glm() fitted() resid()等时,将返回尽可能多的拟合值放置数据中的行.

to your call to glm(), fitted(), resid(), etc would return as many fitted values as you have rows in your put data.

您似乎确实想以一种特殊的方式进行建模.为什么要从您的 mydata 对象创建 X Y ?做起来更好

You do seem to be going about modelling in a peculiar way. Why are you creating X and Y, presumably from your mydata object? It would be far better to do

mod <- glm(Support ~ Race + Sex, data = mydata family = binomial,
           na.action = na.exclude)

现在,这里有了代替匿名 X Y 的变量,这些变量意味着某些含义,而您不必创建重复的数据.

where now instead of the anonymous X and Y we have variables that mean something, and you haven't had to create duplicate data.

这篇关于缺少数据时,如何在模型中使用没有错误的`predict()`?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆