当训练集具有比测试集更多不同的因子水平时,randomForest 不起作用 [英] randomForest does not work when training set has more different factor levels than test set

查看:68
本文介绍了当训练集具有比测试集更多不同的因子水平时,randomForest 不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试在因子水平低于训练数据的新测试数据上测试我训练的模型时,predict() 返回以下内容:

When trying to test my trained model on new test data that has fewer factor levels than my training data, predict() returns the following:

新数据中的预测变量类型与训练数据的类型不匹配.

Type of predictors in new data do not match that of the training data.

我的训练数据有一个有 7 个因子水平的变量,而我的测试数据有一个有 6 个因子水平的相同变量(训练数据中的所有 6 个 ARE).

My training data has a variable with 7 factor levels and my test data has that same variable with 6 factor levels (all 6 ARE in the training data).

当我添加一个包含缺失"第 7 个因子的观察值时,模型会运行,所以我不确定为什么会发生这种情况,甚至不确定其背后的逻辑.

When I add an observation containing the "missing" 7th factor, the model runs, so I'm not sure why this happens or even the logic behind it.

我可以看到测试集是否有更多/不同的因子水平,然后 randomForest 会卡住,但为什么在训练集有更多"数据的情况下?

I could see if the test set had more/different factor levels, then randomForest would choke, but why in the case where training set has "more" data?

推荐答案

R 期望训练数据和测试数据具有完全相同的级别(即使其中一个集合没有给定一个或多个级别的观测值).在您的情况下,由于测试数据集缺少火车具有的级别,您可以执行

R expects both the training and the test data to have the exact same levels (even if one of the sets has no observations for a given level or levels). In your case, since the test dataset is missing a level that the train has, you can do

test$val <- factor(test$val, levels=levels(train$val))

确保它具有相同的级别并且它们的编码相同.

to make sure it has all the same levels and they are coded the same say.

(重新发布在这里以结束问题)

(reposted here to close out the question)

这篇关于当训练集具有比测试集更多不同的因子水平时,randomForest 不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆