在随机森林中使用OOB和k-折x-val时出现不同的插入符号/训练错误 [英] different caret/train erros when using oob and k-fold x-val with random forest
本文介绍了在随机森林中使用OOB和k-折x-val时出现不同的插入符号/训练错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
以下是我使用的代码:
# data set for debugging in RStudio
data("imports85")
input<-imports85
# settings
set.seed(1)
dependent <- make.names("make")
training.share <- 0.75
impute <- "yes"
type <- "class" # either "class" or "regr" from SF doc prop
# split off rows w/o label and then split into test/train using stratified sampling
input.labelled <- input[complete.cases(input[,dependent]),]
train.index <- createDataPartition(input.labelled[,dependent], p=training.share, list=FALSE)
rf.train <- input.labelled[train.index,]
rf.test <- input.labelled[-train.index,]
# create cleaned train data set w/ or w/o imputation
if (impute=="no") {
rf.train.clean <- rf.train[complete.cases(rf.train),] #drop cases w/ missing variables
} else if (impute=="yes") {
rf.train.clean <- rfImpute(rf.train[,dependent] ~ .,rf.train)[,-1] #impute missing variables and remove added duplicate of dependent column
}
# define variables Y and dependent x
Y <- rf.train.clean[, names(rf.train.clean) == dependent]
x <- rf.train.clean[, names(rf.train.clean) != dependent]
# upsample minorty classes (classification only)
if (type=="class") {
rf.train.upsampled <- upSample(x=x, y=Y)
}
# train and tune RF model
cntrl<-trainControl(method = "oob", number=5, p=0.9, sampling = "up", search='grid') # oob error to tune model
tunegrid <- expand.grid(.mtry = (1:5)) #create tunegrid with 5 values from 1:5 for mtry to tunning model
rf <- train(x, Y, method="rf", metric="Accuracy", trControl=cntrl, tuneGrid=tunegrid)
第一个错误与this有关,但使用caret
和randomForest
而不是lars
,我不明白...
顺序错误(x[,1]):‘x’必须是‘sort.list’的原子向量-您是否对列表调用了‘sorte’?
不,我没有在单子上叫‘排序’...至少我不知道;-)
我查看了caret
/train
的documentation,它说x应该是数据帧,这是根据str(x)
的情况。
如果我使用k-折x验证而不是OOB错误,如下所示
cntrl<-trainControl(method = "repeatedcv", number=5, repeats = 2, p=0.9, sampling = "up", search='grid')
还有另一个有趣的错误: y中不能有空类
检查complete.cases(Y)
似乎表明没有空类...
有没有人给我一个提示?
谢谢, 标记
推荐答案
这是因为您的因变量。您选择了make
。你检查过这块地了吗?您有培训和测试;您将只有一个观察结果放在哪里,如make = "mercury"
?你怎么能用它训练呢?如果你没有接受过培训,你怎么能测试它呢?
input %>%
group_by(make) %>%
summarise(count = n()) %>%
arrange(count) %>%
print(n = 22)
# # A tibble: 22 × 2
# make count
# <fct> <int>
# 1 mercury 1
# 2 renault 2
# 3 alfa-romero 3
# 4 chevrolet 3
# 5 jaguar 3
# 6 isuzu 4
# 7 porsche 5
# 8 saab 6
# 9 audi 7
# 10 plymouth 7
# 11 bmw 8
# 12 mercedes-benz 8
# 13 dodge 9
# 14 peugot 11
# 15 volvo 11
# 16 subaru 12
# 17 volkswagen 12
# 18 honda 13
# 19 mitsubishi 13
# 20 mazda 17
# 21 nissan 18
# 22 toyota 32
您在执行函数createDataPartition()
时也出现了警告。我认为randomForest
套餐要求每组至少五个。您可以筛选要包括的组,并将该数据用于测试和培训。
在标记为settings
的注释之前,您可以添加以下内容以设置组的子集并验证结果。
filtGrps <- input %>%
group_by(make) %>%
summarise(count = n()) %>%
filter(count >=5) %>%
select(make) %>%
unlist()
# filter for groups with sufficient observations for package
input <- input %>%
filter(make %in% filtGrps) %>%
droplevels() # then drop the empty levels
# check to see if it filtered as expected
input %>%
group_by(make) %>%
summarise(count = n()) %>%
arrange(-count) %>%
print(n = 16)
这只使用5,这并不理想。(越多越好。)
但是,您的所有代码都可以使用此筛选器。
rf
# Random Forest
#
# 147 samples
# 25 predictor
# 16 classes: 'audi', 'bmw', 'dodge', 'honda', 'mazda', 'mercedes-benz', 'mitsubishi', 'nissan', 'peugot', 'plymouth', 'porsche', 'saab', 'subaru', 'toyota', 'volkswagen', 'volvo'
#
# No pre-processing
# Addtional sampling using up-sampling
#
# Resampling results across tuning parameters:
#
# mtry Accuracy Kappa
# 1 0.9505208 0.9472222
# 2 0.9869792 0.9861111
# 3 0.9869792 0.9861111
# 4 0.9895833 0.9888889
# 5 0.9921875 0.9916667
#
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 5.
rf$finalModel
#
# Call:
# randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
# Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 5
#
# OOB estimate of error rate: 0.52%
# Confusion matrix:
# audi bmw dodge honda mazda mercedes-benz mitsubishi nissan peugot
# audi 24 0 0 0 0 0 0 0 0
# bmw 0 24 0 0 0 0 0 0 0
# dodge 0 0 24 0 0 0 0 0 0
# honda 0 0 0 24 0 0 0 0 0
# mazda 0 0 0 0 24 0 0 0 0
# mercedes-benz 0 0 0 0 0 24 0 0 0
# mitsubishi 0 0 0 0 0 0 24 0 0
# nissan 0 0 0 0 0 0 0 24 0
# peugot 0 0 0 0 0 0 0 0 24
# plymouth 0 0 0 0 0 0 0 0 0
# porsche 0 0 0 0 0 0 0 0 0
# saab 0 0 0 0 0 0 0 0 0
# subaru 0 0 0 0 0 0 0 0 0
# toyota 0 0 0 0 0 0 0 1 0
# volkswagen 0 0 0 0 0 0 0 0 0
# volvo 0 0 0 0 0 0 0 0 0
# plymouth porsche saab subaru toyota volkswagen volvo class.error
# audi 0 0 0 0 0 0 0 0.00000000
# bmw 0 0 0 0 0 0 0 0.00000000
# dodge 0 0 0 0 0 0 0 0.00000000
# honda 0 0 0 0 0 0 0 0.00000000
# mazda 0 0 0 0 0 0 0 0.00000000
# mercedes-benz 0 0 0 0 0 0 0 0.00000000
# mitsubishi 0 0 0 0 0 0 0 0.00000000
# nissan 0 0 0 0 0 0 0 0.00000000
# peugot 0 0 0 0 0 0 0 0.00000000
# plymouth 24 0 0 0 0 0 0 0.00000000
# porsche 0 24 0 0 0 0 0 0.00000000
# saab 0 0 24 0 0 0 0 0.00000000
# subaru 0 0 0 24 0 0 0 0.00000000
# toyota 0 0 0 0 22 0 1 0.08333333
# volkswagen 0 0 0 0 0 24 0 0.00000000
# volvo 0 0 0 0 0 0 24 0.00000000
当然,您仍需要测试此模型。
这篇关于在随机森林中使用OOB和k-折x-val时出现不同的插入符号/训练错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文