在测试数据中具有未知因子水平的predict.lm() [英] predict.lm() with an unknown factor level in test data
问题描述
我正在拟合一个模型来分解数据并进行预测.如果predict.lm()
中的newdata
包含模型未知的单个因子水平,则predict.lm()
的 all 都会失败并返回错误.
I am fitting a model to factor data and predicting. If the newdata
in predict.lm()
contains a single factor level that is unknown to the model, all of predict.lm()
fails and returns an error.
是否有一种好方法可以让predict.lm()
返回模型已知的那些因子水平的预测,而返回未知因子水平的NA,而不仅仅是一个错误?
Is there a good way to have predict.lm()
return a prediction for those factor levels the model knows and NA for unknown factor levels, instead of only an error?
示例代码:
foo <- data.frame(response=rnorm(3),predictor=as.factor(c("A","B","C")))
model <- lm(response~predictor,foo)
foo.new <- data.frame(predictor=as.factor(c("A","B","C","D")))
predict(model,newdata=foo.new)
我希望最后一个命令返回与因子级别"A","B"和"C"相对应的三个真实"预测以及与未知水平"D"相对应的NA
.
I would like the very last command to return three "real" predictions corresponding to factor levels "A", "B" and "C" and an NA
corresponding to the unknown level "D".
推荐答案
通过 MorgenBall 对功能进行了调整和扩展.现在,它也在 sperrorest 中实现.
Tidied and extended the function by MorgenBall. It is also implemented in sperrorest now.
- 删除未使用的因子水平,而不仅仅是将缺失值设置为
NA
. - 向用户发送一条已降低因子水平的消息
- 检查
test_data
中是否存在因子变量,如果不存在则返回原始data.frame. - 不仅适用于
lm
,glm
,而且还适用于glmmPQL
- drops unused factor levels rather than just setting the missing values to
NA
. - issues a message to the user that factor levels have been dropped
- checks for existence of factor variables in
test_data
and returns original data.frame if non are present - works not only for
lm
,glm
and but also forglmmPQL
注意:此处显示的功能可能会随时间变化(改进).
Note: The function shown here may change (improve) over time.
#' @title remove_missing_levels
#' @description Accounts for missing factor levels present only in test data
#' but not in train data by setting values to NA
#'
#' @import magrittr
#' @importFrom gdata unmatrix
#' @importFrom stringr str_split
#'
#' @param fit fitted model on training data
#'
#' @param test_data data to make predictions for
#'
#' @return data.frame with matching factor levels to fitted model
#'
#' @keywords internal
#'
#' @export
remove_missing_levels <- function(fit, test_data) {
# https://stackoverflow.com/a/39495480/4185785
# drop empty factor levels in test data
test_data %>%
droplevels() %>%
as.data.frame() -> test_data
# 'fit' object structure of 'lm' and 'glmmPQL' is different so we need to
# account for it
if (any(class(fit) == "glmmPQL")) {
# Obtain factor predictors in the model and their levels
factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
names(unlist(fit$contrasts))))
# do nothing if no factors are present
if (length(factors) == 0) {
return(test_data)
}
map(fit$contrasts, function(x) names(unmatrix(x))) %>%
unlist() -> factor_levels
factor_levels %>% str_split(":", simplify = TRUE) %>%
extract(, 1) -> factor_levels
model_factors <- as.data.frame(cbind(factors, factor_levels))
} else {
# Obtain factor predictors in the model and their levels
factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
names(unlist(fit$xlevels))))
# do nothing if no factors are present
if (length(factors) == 0) {
return(test_data)
}
factor_levels <- unname(unlist(fit$xlevels))
model_factors <- as.data.frame(cbind(factors, factor_levels))
}
# Select column names in test data that are factor predictors in
# trained model
predictors <- names(test_data[names(test_data) %in% factors])
# For each factor predictor in your data, if the level is not in the model,
# set the value to NA
for (i in 1:length(predictors)) {
found <- test_data[, predictors[i]] %in% model_factors[
model_factors$factors == predictors[i], ]$factor_levels
if (any(!found)) {
# track which variable
var <- predictors[i]
# set to NA
test_data[!found, predictors[i]] <- NA
# drop empty factor levels in test data
test_data %>%
droplevels() -> test_data
# issue warning to console
message(sprintf(paste0("Setting missing levels in '%s', only present",
" in test data but missing in train data,",
" to 'NA'."),
var))
}
}
return(test_data)
}
我们可以将此功能应用于问题中的示例,如下所示:
We can apply this function to the example in the question as follows:
predict(model,newdata=remove_missing_levels (fit=model, test_data=foo.new))
在尝试改善此功能时,我遇到了这样一个事实,即lm
,glm
等SL学习方法在训练中需要相同的水平.如果删除级别,则ML学习方法(svm
,randomForest
)失败时进行测试.这些方法需要火车和火车上的所有级别.测试.
While trying to improve this function, I came across the fact that SL learning methods like lm
, glm
etc. need the same levels in train & test while ML learning methods (svm
, randomForest
) fail if the levels are removed. These methods need all levels in train & test.
由于每个拟合模型都有不同的因子水平分量存储方式(对于lm
是fit$xlevels
,对于glmmPQL
是fit$contrasts
),因此很难实现一般的解决方案.至少在lm
相关模型之间似乎是一致的.
A general solution is quite hard to achieve since every fitted model has a different way of storing their factor level component (fit$xlevels
for lm
and fit$contrasts
for glmmPQL
). At least it seems to be consistent across lm
related models.
这篇关于在测试数据中具有未知因子水平的predict.lm()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!