在随机森林中使用OOB和k-折x-val时出现不同的插入符号/训练错误 [英] different caret/train erros when using oob and k-fold x-val with random forest

查看:40
本文介绍了在随机森林中使用OOB和k-折x-val时出现不同的插入符号/训练错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是我使用的代码:

# data set for debugging in RStudio
data("imports85")
input<-imports85

# settings
set.seed(1)
dependent <- make.names("make")
training.share <- 0.75
impute <- "yes"
type <- "class" # either "class" or "regr" from SF doc prop


# split off rows w/o label and then split into test/train using stratified sampling
input.labelled <- input[complete.cases(input[,dependent]),]
train.index <- createDataPartition(input.labelled[,dependent], p=training.share, list=FALSE)
rf.train <- input.labelled[train.index,]
rf.test <- input.labelled[-train.index,]

# create cleaned train data set w/ or w/o imputation
if (impute=="no") {
    rf.train.clean <- rf.train[complete.cases(rf.train),] #drop cases w/ missing variables
} else if (impute=="yes") {
    rf.train.clean <- rfImpute(rf.train[,dependent] ~ .,rf.train)[,-1] #impute missing variables and remove added duplicate of dependent column
}

# define variables Y and dependent x
Y <- rf.train.clean[, names(rf.train.clean) == dependent]
x <- rf.train.clean[, names(rf.train.clean) != dependent]

# upsample minorty classes (classification only)
if (type=="class") {
    rf.train.upsampled <- upSample(x=x, y=Y)
}

# train and tune RF model
cntrl<-trainControl(method = "oob", number=5, p=0.9, sampling = "up", search='grid') # oob error to tune model
tunegrid <- expand.grid(.mtry = (1:5)) #create tunegrid with 5 values from 1:5 for mtry to tunning model
rf <- train(x, Y, method="rf", metric="Accuracy", trControl=cntrl, tuneGrid=tunegrid)
第一个错误与this有关,但使用caretrandomForest而不是lars,我不明白... 顺序错误(x[,1]):‘x’必须是‘sort.list’的原子向量-您是否对列表调用了‘sorte’? 不,我没有在单子上叫‘排序’...至少我不知道;-)

我查看了caret/traindocumentation,它说x应该是数据帧,这是根据str(x)的情况。

如果我使用k-折x验证而不是OOB错误,如下所示

cntrl<-trainControl(method = "repeatedcv", number=5, repeats = 2, p=0.9, sampling = "up", search='grid')

还有另一个有趣的错误: y中不能有空类

检查complete.cases(Y)似乎表明没有空类...

有没有人给我一个提示?

谢谢, 标记

推荐答案

这是因为您的因变量。您选择了make。你检查过这块地了吗?您有培训和测试;您将只有一个观察结果放在哪里,如make = "mercury"?你怎么能用它训练呢?如果你没有接受过培训,你怎么能测试它呢?

input %>% 
  group_by(make) %>% 
  summarise(count = n()) %>% 
  arrange(count) %>% 
  print(n = 22)

# # A tibble: 22 × 2
#    make        count
#    <fct>       <int>
#  1 mercury         1
#  2 renault         2
#  3 alfa-romero     3
#  4 chevrolet       3
#  5 jaguar          3
#  6 isuzu           4
#  7 porsche         5
#  8 saab            6
#  9 audi            7
# 10 plymouth        7
# 11 bmw             8
# 12 mercedes-benz   8
# 13 dodge           9
# 14 peugot         11
# 15 volvo          11
# 16 subaru         12
# 17 volkswagen     12
# 18 honda          13
# 19 mitsubishi     13
# 20 mazda          17
# 21 nissan         18
# 22 toyota         32

您在执行函数createDataPartition()时也出现了警告。我认为randomForest套餐要求每组至少五个。您可以筛选要包括的组,并将该数据用于测试和培训。

在标记为settings的注释之前,您可以添加以下内容以设置组的子集并验证结果。

filtGrps <- input %>% 
  group_by(make) %>% 
  summarise(count = n()) %>% 
  filter(count >=5) %>% 
  select(make) %>% 
  unlist()

# filter for groups with sufficient observations for package
input <- input %>% 
  filter(make %in% filtGrps) %>% 
  droplevels() # then drop the empty levels

# check to see if it filtered as expected
input %>% 
  group_by(make) %>% 
  summarise(count = n()) %>% 
  arrange(-count) %>% 
  print(n = 16)

这只使用5,这并不理想。(越多越好。)

但是,您的所有代码都可以使用此筛选器。

rf
# Random Forest 
# 
# 147 samples
#  25 predictor
#  16 classes: 'audi', 'bmw', 'dodge', 'honda', 'mazda', 'mercedes-benz', 'mitsubishi', 'nissan', 'peugot', 'plymouth', 'porsche', 'saab', 'subaru', 'toyota', 'volkswagen', 'volvo' 
# 
# No pre-processing
# Addtional sampling using up-sampling
# 
# Resampling results across tuning parameters:
# 
#   mtry  Accuracy   Kappa    
#   1     0.9505208  0.9472222
#   2     0.9869792  0.9861111
#   3     0.9869792  0.9861111
#   4     0.9895833  0.9888889
#   5     0.9921875  0.9916667
# 
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 5. 
rf$finalModel
# 
# Call:
#  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
#                Type of random forest: classification
#                      Number of trees: 500
# No. of variables tried at each split: 5
# 
#         OOB estimate of  error rate: 0.52%
# Confusion matrix:
#               audi bmw dodge honda mazda mercedes-benz mitsubishi nissan peugot
# audi            24   0     0     0     0             0          0      0      0
# bmw              0  24     0     0     0             0          0      0      0
# dodge            0   0    24     0     0             0          0      0      0
# honda            0   0     0    24     0             0          0      0      0
# mazda            0   0     0     0    24             0          0      0      0
# mercedes-benz    0   0     0     0     0            24          0      0      0
# mitsubishi       0   0     0     0     0             0         24      0      0
# nissan           0   0     0     0     0             0          0     24      0
# peugot           0   0     0     0     0             0          0      0     24
# plymouth         0   0     0     0     0             0          0      0      0
# porsche          0   0     0     0     0             0          0      0      0
# saab             0   0     0     0     0             0          0      0      0
# subaru           0   0     0     0     0             0          0      0      0
# toyota           0   0     0     0     0             0          0      1      0
# volkswagen       0   0     0     0     0             0          0      0      0
# volvo            0   0     0     0     0             0          0      0      0
#               plymouth porsche saab subaru toyota volkswagen volvo class.error
# audi                 0       0    0      0      0          0     0  0.00000000
# bmw                  0       0    0      0      0          0     0  0.00000000
# dodge                0       0    0      0      0          0     0  0.00000000
# honda                0       0    0      0      0          0     0  0.00000000
# mazda                0       0    0      0      0          0     0  0.00000000
# mercedes-benz        0       0    0      0      0          0     0  0.00000000
# mitsubishi           0       0    0      0      0          0     0  0.00000000
# nissan               0       0    0      0      0          0     0  0.00000000
# peugot               0       0    0      0      0          0     0  0.00000000
# plymouth            24       0    0      0      0          0     0  0.00000000
# porsche              0      24    0      0      0          0     0  0.00000000
# saab                 0       0   24      0      0          0     0  0.00000000
# subaru               0       0    0     24      0          0     0  0.00000000
# toyota               0       0    0      0     22          0     1  0.08333333
# volkswagen           0       0    0      0      0         24     0  0.00000000
# volvo                0       0    0      0      0          0    24  0.00000000 

当然,您仍需要测试此模型。

这篇关于在随机森林中使用OOB和k-折x-val时出现不同的插入符号/训练错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆