R - Caret train() “错误:停止"“不是所有在新数据中找到的对象中使用的所有变量名" [英] R - Caret train() "Error: Stopping" with "Not all variable names used in object found in newdata"

查看:65
本文介绍了R - Caret train() “错误:停止"“不是所有在新数据中找到的对象中使用的所有变量名"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为朴素贝叶斯分类器://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"rel =" nofollow noreferrer>蘑菇数据.我想将所有变量用作分类预测变量,以预测蘑菇是否可食用.

我正在使用插入符包./p>

这是我的完整代码:

  #################################################################################准备R和R Studio环境################################################################################清除R studio控制台猫("\ 014")#从环境中删除对象rm(列表= ls())#如有必要,安装和加载软件包如果(!require(tidyverse)){install.packages("tidyverse")图书馆(tidyverse)}如果(!require(插入符号)){install.packages(插入符")图书馆(插入符号)}如果(!require(klaR)){install.packages("klaR")图书馆(klaR)}################################蘑菇<-read.csv("agaricus-lepiota.data",stringsAsFactors = TRUE,标头= FALSE)na.omit(蘑菇)名称(蘑菇)<-c(食用性","capShape","capSurface","cap-color","bruises",气味","g附着物","ill间距","g尺寸","g色",茎形",茎根",茎表面上方环",茎表面上方"环下方,环上方的茎颜色,环下方的茎颜色,面纱类型,面纱颜色,环数,环数,环型",孢子印刷色",种群",栖息地")#将瘀伤转换为逻辑变量蘑菇$肉酱<-蘑菇$肉酱=='t'set.seed(1234)拆分<-createDataPartition(mushrooms $ edibility,p = 0.8,list = FALSE)火车<-蘑菇[拆分]测试<-蘑菇[-split,]预测变量<-名称(train)[2:20]#创建响应和预测变量数据x<-train [,predictors] #predictorsy<-培训$ editability#响应train_control<-trainControl(method ="cv",number = 1)#设置1折交叉验证edibility_mod1<-train(#训练模型x = x,y = y,方法="nb",trControl = train_control) 

执行train()函数时,我得到以下输出:

 出了点问题;所有精度指标值均缺失:精度卡伯最小:不适用:不适用第一区:NA第一区:NA中位数:不适用中位数:不适用均值:NaN均值:NaN第三名:NA第三名:NA最大限度.:不适用:不适用NA:2 NA:2错误:正在停止另外:警告消息:1:对Fold1的预测失败:usekernel = TRUE,fL = 0,adjust = 1预测错误.NaiveBayes(modelFit,newdata):并非在newdata中找到对象中使用的所有变量名2:针对Fold1的模型拟合失败:usekernel = FALSE,fL = 0,adjust = 1 x [,2]中的错误:下标超出范围3:在nominalTrainWorkflow(x = x,y = y,wts =重量,info = trainInfo,:重新采样的绩效指标中缺少值. 

脚本运行后的x和y:

 >str(x)'data.frame':6500磅.19个变量中:$ capShape :具有 6 个级别b"、c"、f"、k"、..的因子:6 6 1 6 6 6 1 1 6 1 ...$ capSurface:因子w/4级"f","g","s","y":3 3 3 4 3 4 3 4 4 3 ...$ cap-color:因子w/10级"b","c","e","g" ..:5 10 9 9 4 10 9 9 9 10 ...$ bruises:logi是是是是是否是...气味:因子w/9水平为"a","c","f","l" ..:7 1 4 7 6 1 1 4 7 1 ...$ g附着物:因子w/2水平"a","f":2 2 2 2 2 2 2 2 2 2 2 ...$ gill-spacing:因子w/2级"c","w":1 1 1 1 2 2 1 1 1 1 ...$ gill-size:具有2个级别"b","n"的因子:2 1 1 2 1 1 1 1 1 2 1 ...$ gill-color:因子w/12级"b","e","g","h" ..:5 5 6 6 5 6 3 6 8 3 ...茎状形状:具有2级"e","t"的因子:1 1 1 1 2 2 1 1 1 1 ...$茎根:因子w/5级?","b","c","e",..:4 3 3 4 4 3 3 3 4 3 ...$茎表面上方环:因子w/4级"f","k","s","y":3 3 3 3 3 3 3 3 3 ...$茎杆表面-下环:因子w/4级"f","k","s","y":3 3 3 3 3 3 3 3 3 ...$茎颜色在环上方:因子w/9级"b","c","e","g" ..:8 8 8 8 8 8 8 8 8 8 ...$茎颜色在环下方:因子w/9级"b","c","e","g" ..:8 8 8 8 8 8 8 8 8 8 ...$面纱类型:因子w/1级"p":1 1 1 1 1 1 1 1 1 1 ...veil-color:因子w/4级"n","o","w","y":3 3 3 3 3 3 3 3 3 3 ...$ ring-number:具有3个级别"n","o","t"的因子:2 2 2 2 2 2 2 2 2 2 2 ...环型:w/5级因子"e","f","l","n" ..:5 5 5 5 1 5 5 5 5 ...>力量w/2级因子"e","p":2 1 1 2 1 1 1 1 2 1 ... 

我的环境是:

 >版本_平台x86_64-apple-darwin17.0拱x86_64操作系统darwin17.0系统x86_64,darwin17.0地位大四小0.32020年第10个月第十天svn版本79318语言Rversion.string R版本4.0.3(2020-10-10)绰号Bunny-Wunnies Freak Out>RStudio.Version()$引文要在出版物中引用RStudio,请使用:RStudio团队(2020年).RStudio:R.RStudio的集成开发环境,PBC,波士顿,马萨诸塞州URL http://www.rstudio.com/.LaTeX用户的BibTeX条目是@手动的{,title = {RStudio:R的集成开发环境},作者= {{RStudio Team}},组织= {RStudio,PBC},地址= {马萨诸塞州波士顿},年= {2020},网址 = {http://www.rstudio.com/},}$模式[1]桌面"$ version[1]‘1.3.1093’$ release_name[1]杏金盏花"指的是金盏花". 

解决方案

您尝试做的是有点棘手,大多数朴素贝叶斯实现或至少您正在使用的实现(来自 kLAR,它源自 e1071)使用正态分布.您可以在 naiveBayes来自e1071的帮助页面的详细信息下查看:

标准朴素贝叶斯分类器(至少此实现)假设预测变量的独立性和高斯指标预测值的分布(给定目标类别).为了缺少值的属性,相应的表条目为省略以进行预测.

并且您的预测变量是绝对的,因此这可能会出现问题.您可以尝试将 kernel = TRUE adjust = 1 设置为强制使其正常,并避免使用 kernel = FALSE 会引发错误.

在我们删除仅具有1级的列并排序列名称之前,在这种情况下,使用公式更容易,并且避免使用虚拟变量:

  df =火车等级(df [[面纱类型"]])[1]"p"表示df [[面纱类型"]] = NULLcolnames(df)= gsub(-","_",colnames(df))网格= expand.grid(usekernel = TRUE,adjust = 1,fL = c(0.2,0.5,0.8))mod1<-训练(edibility〜.,data = df,方法="nb",trControl = trainControl(方法="cv",数字= 5),tuneGrid =网格)模组1朴素贝叶斯6500个样本21个预测变量2类:"e","p"无需预处理重采样:交叉验证(5折)样本大小摘要:5200、5200、5200、5200、5200通过调整参数重新采样结果:fL精度卡伯0.2 0.9243077 0.84786240.5 0.9243077 0.84786240.8 0.9243077 0.8478624调整参数'usekernel'保持为TRUE不变调整参数"adjust"保持恒定为1精度用于使用最大值选择最佳模型.用于模型的最终值为fL = 0.2,usekernel = TRUE和调整= 1. 

I am trying to build a simple Naive Bayes classifer for mushroom data. I want to use all of the variables as categorical predictors to predict if a mushroom is edible.

I am using caret package.

Here is my code in full:

##################################################################################
# Prepare R and R Studio environment
##################################################################################

# Clear the R studio console
cat("\014")

# Remove objects from environment
rm(list = ls())

# Install and load packages if necessary
if (!require(tidyverse)) {
  install.packages("tidyverse")
  library(tidyverse)
}
if (!require(caret)) {
  install.packages("caret")
  library(caret)
}
if (!require(klaR)) {
  install.packages("klaR")
  library(klaR)
}

#################################

mushrooms <- read.csv("agaricus-lepiota.data", stringsAsFactors = TRUE, header = FALSE)

na.omit(mushrooms)

names(mushrooms) <- c("edibility", "capShape", "capSurface", "cap-color", "bruises", "odor", "gill-attachment", "gill-spacing", "gill-size", "gill-color", "stalk-shape", "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring", "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", "veil-color", "ring-number", "ring-type", "spore-print-color", "population", "habitat")

# convert bruises to a logical variable
mushrooms$bruises <- mushrooms$bruises == 't'

set.seed(1234)
split <- createDataPartition(mushrooms$edibility, p = 0.8, list = FALSE)

train <- mushrooms[split, ]
test <- mushrooms[-split, ]

predictors <- names(train)[2:20] #Create response and predictor data

x <- train[,predictors] #predictors
y <- train$edibility #response

train_control <- trainControl(method = "cv", number = 1) # Set up 1 fold cross validation

edibility_mod1 <- train( #train the model
  x = x,
  y = y,
  method = "nb", 
  trControl = train_control
)

When executing the train() function I get the following output:

Something is wrong; all the Accuracy metric values are missing:
    Accuracy       Kappa    
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
 NA's   :2     NA's   :2    
Error: Stopping
In addition: Warning messages:
1: predictions failed for Fold1: usekernel= TRUE, fL=0, adjust=1 Error in predict.NaiveBayes(modelFit, newdata) : 
  Not all variable names used in object found in newdata
 
2: model fit failed for Fold1: usekernel=FALSE, fL=0, adjust=1 Error in x[, 2] : subscript out of bounds
 
3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

x and y after script run:

> str(x)
'data.frame':   6500 obs. of  19 variables:
 $ capShape                : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
 $ capSurface              : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
 $ cap-color               : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
 $ bruises                 : logi  TRUE TRUE TRUE TRUE FALSE TRUE ...
 $ odor                    : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
 $ gill-attachment         : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
 $ gill-spacing            : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
 $ gill-size               : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
 $ gill-color              : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
 $ stalk-shape             : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
 $ stalk-root              : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
 $ stalk-surface-above-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
 $ stalk-surface-below-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
 $ stalk-color-above-ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ stalk-color-below-ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ veil-type               : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
 $ veil-color              : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
 $ ring-number             : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
 $ ring-type               : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...



> str(y)
 Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...

My environment is:

> R.version
               _                           
platform       x86_64-apple-darwin17.0     
arch           x86_64                      
os             darwin17.0                  
system         x86_64, darwin17.0          
status                                     
major          4                           
minor          0.3                         
year           2020                        
month          10                          
day            10                          
svn rev        79318                       
language       R                           
version.string R version 4.0.3 (2020-10-10)
nickname       Bunny-Wunnies Freak Out     
> RStudio.Version()
$citation

To cite RStudio in publications use:

  RStudio Team (2020). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {RStudio: Integrated Development Environment for R},
    author = {{RStudio Team}},
    organization = {RStudio, PBC},
    address = {Boston, MA},
    year = {2020},
    url = {http://www.rstudio.com/},
  }


$mode
[1] "desktop"

$version
[1] ‘1.3.1093’

$release_name
[1] "Apricot Nasturtium"

解决方案

What you are trying to do is a bit tricky, most naive bayes implementation or at least the one you are using (from kLAR which is derived from e1071) uses a normal distribution. You can see under the details of naiveBayes help page from e1071:

The standard naive Bayes classifier (at least this implementation) assumes independence of the predictor variables, and Gaussian distribution (given the target class) of metric predictors. For attributes with missing values, the corresponding table entries are omitted for prediction.

And your predictors are categorical so this might be problematic. You can try to set kernel=TRUE and adjust=1 to force it towards normal, and avoid kernel=FALSE which will throw the error.

Before that we remove columns with only 1 level and sort out the column names, also in this case it's easier to use the formula and avoid the making dummy variables :

df = train 
levels(df[["veil-type"]])
[1] "p"
df[["veil-type"]]=NULL
colnames(df) = gsub("-","_",colnames(df))

Grid = expand.grid(usekernel=TRUE,adjust=1,fL=c(0.2,0.5,0.8))

mod1 <- train(edibility~.,data=df,
  method = "nb", trControl = trainControl(method="cv",number=5),
  tuneGrid=Grid
)

 mod1
Naive Bayes 

6500 samples
  21 predictor
   2 classes: 'e', 'p' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 5200, 5200, 5200, 5200, 5200 
Resampling results across tuning parameters:

  fL   Accuracy   Kappa    
  0.2  0.9243077  0.8478624
  0.5  0.9243077  0.8478624
  0.8  0.9243077  0.8478624

Tuning parameter 'usekernel' was held constant at a value of TRUE

Tuning parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0.2, usekernel = TRUE and
 adjust = 1.

这篇关于R - Caret train() “错误:停止"“不是所有在新数据中找到的对象中使用的所有变量名"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆