当训练集具有比测试集更多不同的因子水平时,randomForest 不起作用 [英] randomForest does not work when training set has more different factor levels than test set
问题描述
尝试在因子水平低于训练数据的新测试数据上测试我训练的模型时,predict()
返回以下内容:
When trying to test my trained model on new test data that has fewer factor levels than my training data, predict()
returns the following:
新数据中的预测变量类型与训练数据的类型不匹配.
Type of predictors in new data do not match that of the training data.
我的训练数据有一个有 7 个因子水平的变量,而我的测试数据有一个有 6 个因子水平的相同变量(训练数据中的所有 6 个 ARE).
My training data has a variable with 7 factor levels and my test data has that same variable with 6 factor levels (all 6 ARE in the training data).
当我添加一个包含缺失"第 7 个因子的观察值时,模型会运行,所以我不确定为什么会发生这种情况,甚至不确定其背后的逻辑.
When I add an observation containing the "missing" 7th factor, the model runs, so I'm not sure why this happens or even the logic behind it.
我可以看到测试集是否有更多/不同的因子水平,然后 randomForest 会卡住,但为什么在训练集有更多"数据的情况下?
I could see if the test set had more/different factor levels, then randomForest would choke, but why in the case where training set has "more" data?
推荐答案
R 期望训练数据和测试数据具有完全相同的级别(即使其中一个集合没有给定一个或多个级别的观测值).在您的情况下,由于测试数据集缺少火车具有的级别,您可以执行
R expects both the training and the test data to have the exact same levels (even if one of the sets has no observations for a given level or levels). In your case, since the test dataset is missing a level that the train has, you can do
test$val <- factor(test$val, levels=levels(train$val))
确保它具有相同的级别并且它们的编码相同.
to make sure it has all the same levels and they are coded the same say.
(重新发布在这里以结束问题)
(reposted here to close out the question)
这篇关于当训练集具有比测试集更多不同的因子水平时,randomForest 不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!