R因子考试卡具有新水平 [英] R - factor examcard has new levels

查看:160
本文介绍了R因子考试卡具有新水平的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用下面给出的C5.0在R中建立了分类模型:

I built a classification model in R using C5.0 given below:

library(C50)
library(caret)
a = read.csv("All_SRN.csv")
set.seed(123)
inTrain <- createDataPartition(a$anatomy, p = .70, list = FALSE)
training <- a[ inTrain,]
test <- a[-inTrain,]
Tree <- C5.0(anatomy ~ ., data = training, 
            trControl = trainControl(method = "repeatedcv", repeats = 10,
                                     classProb = TRUE))
TreePred <- predict(Tree, test)

训练集具有-examcard, coil_used, anatomy_region, bodypart_anatomy and anatomy(目标类别)之类的功能.所有功能都是分类变量.总共有10k个奇数值,我将数据分为训练数据和测试数据.学习者使用此 training test 集以70:30的比例分配时工作得很好,但是当我为测试集提供以下给出的新值时,问题就来了:

The training set has features like - examcard, coil_used, anatomy_region, bodypart_anatomy and anatomy(target class). All the features are categorical variables. There are a total of 10k odd values, I divided the data into training and test data. The learner worked great with this training and test set partioned in 70:30 ratio, but the problem comes when I provide the test set with new values given below:

TreePred <- predict(Tree, test_add)

在这里, test_add 包含已经存在的测试集和一组新值,并且在执行学习器时无法对新值进行分类并引发以下错误:

Here, test_add contains the already present test set and a set of new values and on executing the learner fails to classify the new values and throws the following error:

Error in model.frame.default(object$Terms, newdata, na.action = na.action, : factor examcard has new levels

我尝试使用以下方法将新的因子水平与现有因子水平合并:

I tried to merge the new factor levels with the existing one using:

Tree$xlevels[["examcard"]] <- union(Tree$xlevels[["examcard"]], levels(test_add$examcard))

但是,这并没有多大帮助,因为代码是按照以下消息执行的,并且没有产生任何卓有成效的结果:

But, this wasn't of much help since the code executed with the following message and didn't yield any fruitful result:

predict code called exit with value 1

feaure考试卡在分类中具有很多优势,因此不容忽视.这些值如何分类?

The feaure examcard holds a good deal of primacy in the classification hence can't be ignored. How can these set of values be classified?

推荐答案

您无法为训练集中缺少的测试集中的因子水平创建预测.您的模型将没有这些新因子水平的系数.

You cannot create a prediction for factor levels in your test set that are absent in your training set. Your model will not have coefficients for these new factor levels.

如果要进行70/30拆分,则需要使用caret::CreateDataPartition ...

If you are doing a 70/30 split, you need to repartition your data using caret::CreateDataPartition...

...或您自己的分层样本函数,以确保所有水平都在训练集中体现:使用"split-apply-combine"方法:按检查卡拆分数据集,并针对每个子集应用拆分,然后结合训练子集和测试子集.

... or your own stratified sample function to ensure that all levels are represented in the training set: use the "split-apply-combine" approach: split the data set by examcard, and for each subset, apply the split, then combine the training subsets and the testing subsets.

请参阅此问题以获得更多详细信息.

See this question for more details.

这篇关于R因子考试卡具有新水平的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆