在测试数据中具有未知因子水平的predict.lm() [英] predict.lm() with an unknown factor level in test data

查看:83
本文介绍了在测试数据中具有未知因子水平的predict.lm()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在拟合一个模型来分解数据并进行预测.如果predict.lm()中的newdata包含模型未知的单个因子水平,则predict.lm() all 都会失败并返回错误.

I am fitting a model to factor data and predicting. If the newdata in predict.lm() contains a single factor level that is unknown to the model, all of predict.lm() fails and returns an error.

是否有一种好方法可以让predict.lm()返回模型已知的那些因子水平的预测,而返回未知因子水平的NA,而不仅仅是一个错误?

Is there a good way to have predict.lm() return a prediction for those factor levels the model knows and NA for unknown factor levels, instead of only an error?

示例代码:

foo <- data.frame(response=rnorm(3),predictor=as.factor(c("A","B","C")))
model <- lm(response~predictor,foo)
foo.new <- data.frame(predictor=as.factor(c("A","B","C","D")))
predict(model,newdata=foo.new)

我希望最后一个命令返回与因子级别"A","B"和"C"相对应的三个真实"预测以及与未知水平"D"相对应的NA.

I would like the very last command to return three "real" predictions corresponding to factor levels "A", "B" and "C" and an NA corresponding to the unknown level "D".

推荐答案

通过 MorgenBall 对功能进行了调整和扩展.现在,它也在 sperrorest 中实现.

Tidied and extended the function by MorgenBall. It is also implemented in sperrorest now.

  • 删除未使用的因子水平,而不仅仅是将缺失值设置为NA.
  • 向用户发送一条已降低因子水平的消息
  • 检查test_data中是否存在因子变量,如果不存在则返回原始data.frame.
  • 不仅适用于lmglm,而且还适用于glmmPQL
  • drops unused factor levels rather than just setting the missing values to NA.
  • issues a message to the user that factor levels have been dropped
  • checks for existence of factor variables in test_data and returns original data.frame if non are present
  • works not only for lm, glm and but also for glmmPQL

注意:此处显示的功能可能会随时间变化(改进).

Note: The function shown here may change (improve) over time.

#' @title remove_missing_levels
#' @description Accounts for missing factor levels present only in test data
#' but not in train data by setting values to NA
#'
#' @import magrittr
#' @importFrom gdata unmatrix
#' @importFrom stringr str_split
#'
#' @param fit fitted model on training data
#'
#' @param test_data data to make predictions for
#'
#' @return data.frame with matching factor levels to fitted model
#'
#' @keywords internal
#'
#' @export
remove_missing_levels <- function(fit, test_data) {

  # https://stackoverflow.com/a/39495480/4185785

  # drop empty factor levels in test data
  test_data %>%
    droplevels() %>%
    as.data.frame() -> test_data

  # 'fit' object structure of 'lm' and 'glmmPQL' is different so we need to
  # account for it
  if (any(class(fit) == "glmmPQL")) {
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
                     names(unlist(fit$contrasts))))
    # do nothing if no factors are present
    if (length(factors) == 0) {
      return(test_data)
    }

    map(fit$contrasts, function(x) names(unmatrix(x))) %>%
      unlist() -> factor_levels
    factor_levels %>% str_split(":", simplify = TRUE) %>%
      extract(, 1) -> factor_levels

    model_factors <- as.data.frame(cbind(factors, factor_levels))
  } else {
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
                     names(unlist(fit$xlevels))))
    # do nothing if no factors are present
    if (length(factors) == 0) {
      return(test_data)
    }

    factor_levels <- unname(unlist(fit$xlevels))
    model_factors <- as.data.frame(cbind(factors, factor_levels))
  }

  # Select column names in test data that are factor predictors in
  # trained model

  predictors <- names(test_data[names(test_data) %in% factors])

  # For each factor predictor in your data, if the level is not in the model,
  # set the value to NA

  for (i in 1:length(predictors)) {
    found <- test_data[, predictors[i]] %in% model_factors[
      model_factors$factors == predictors[i], ]$factor_levels
    if (any(!found)) {
      # track which variable
      var <- predictors[i]
      # set to NA
      test_data[!found, predictors[i]] <- NA
      # drop empty factor levels in test data
      test_data %>%
        droplevels() -> test_data
      # issue warning to console
      message(sprintf(paste0("Setting missing levels in '%s', only present",
                             " in test data but missing in train data,",
                             " to 'NA'."),
                      var))
    }
  }
  return(test_data)
}

我们可以将此功能应用于问题中的示例,如下所示:

We can apply this function to the example in the question as follows:

predict(model,newdata=remove_missing_levels (fit=model, test_data=foo.new))

在尝试改善此功能时,我遇到了这样一个事实,即lmglm等SL学习方法在训练中需要相同的水平.如果删除级别,则ML学习方法(svmrandomForest)失败时进行测试.这些方法需要火车和火车上的所有级别.测试.

While trying to improve this function, I came across the fact that SL learning methods like lm, glm etc. need the same levels in train & test while ML learning methods (svm, randomForest) fail if the levels are removed. These methods need all levels in train & test.

由于每个拟合模型都有不同的因子水平分量存储方式(对于lmfit$xlevels,对于glmmPQLfit$contrasts),因此很难实现一般的解决方案.至少在lm相关模型之间似乎是一致的.

A general solution is quite hard to achieve since every fitted model has a different way of storing their factor level component (fit$xlevels for lm and fit$contrasts for glmmPQL). At least it seems to be consistent across lm related models.

这篇关于在测试数据中具有未知因子水平的predict.lm()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆