在 R 中的 CARET 中训练、验证、测试拆分模型 [英] train,validation, test split model in CARET in R

查看:54
本文介绍了在 R 中的 CARET 中训练、验证、测试拆分模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想寻求帮助.我使用此代码运行 Caret 包中的 XGboost 模型.但是,我想使用基于时间的验证拆分.我想要 60% 的训练,20% 的验证,20% 的测试.我已经拆分了数据,但如果不是交叉验证,我知道如何处理验证数据.

I would like to ask for help please. I use this code to run the XGboost model in the Caret package. However, I want to use the validation split based on time. I want 60% training, 20% validation ,20% testing. I already split the data, but I do know how to deal with the validation data if it is not cross-validation.

谢谢,

xgb_trainControl = trainControl(
method = "cv",
number = 5,
returnData = FALSE
)

xgb_grid <- expand.grid(nrounds = 1000,
                              eta = 0.01,
                              max_depth = 8,
                              gamma = 1,
                              colsample_bytree = 1,
                              min_child_weight = 1,
                              subsample = 1
)
set.seed(123)
xgb1 = train(sale~., data = trans_train,
  trControl = xgb_trainControl,
  tuneGrid = xgb_grid,
   method = "xgbTree",
)
xgb1
pred = predict(lm1, trans_test)

推荐答案

在创建模型时不应使用验证分区 - 在使用训练"对模型进行训练和调整之前,应将其搁置"和调整"分区,然后您可以应用该模型来预测验证数据集的结果并总结预测的准确程度.

The validation partition should not be used when you are creating the model - it should be 'set aside' until the model is trained and tuned using the 'training' and 'tuning' partitions, then you can apply the model to predict the outcome of the validation dataset and summarise how accurate the predictions were.

例如,在我自己的工作中,我创建了三个分区:训练 (75%)、调优 (10%) 和测试/验证 (15%) 使用

For example, in my own work I create three partitions: training (75%), tuning (10%) and testing/validation (15%) using

# Define the partition (e.g. 75% of the data for training)
trainIndex <- createDataPartition(data$response, p = .75, 
                                  list = FALSE, 
                                  times = 1)

# Split the dataset using the defined partition
train_data <- data[trainIndex, ,drop=FALSE]
tune_plus_val_data <- data[-trainIndex, ,drop=FALSE]

# Define a new partition to split the remaining 25%
tune_plus_val_index <- createDataPartition(tune_plus_val_data$response,
                                           p = .6,
                                           list = FALSE,
                                           times = 1)

# Split the remaining ~25% of the data: 40% (tune) and 60% (val)
tune_data <- tune_plus_val_data[-tune_plus_val_index, ,drop=FALSE]
val_data <- tune_plus_val_data[tune_plus_val_index, ,drop=FALSE]

# Outcome of this section is that the data (100%) is split into:
# training (~75%)
# tuning (~10%)
# validation (~15%)

这些数据分区被转换为 xgb.DMatrix 矩阵(dtrain"、dtune"、dval").然后我使用training"分区来训练模型,使用tuning"分区来调整超参数(例如随机网格搜索)并评估模型训练(例如交叉验证).这相当于您问题中的代码.

These data partitions are converted to xgb.DMatrix matrices ("dtrain", "dtune", "dval"). I then use the 'training' partition to train models and the 'tuning' partition to tune hyperparameters (e.g. random grid search) and evaluate model training (e.g. cross validation). This is ~equivalent to the code in your question.

lrn_tune <- setHyperPars(lrn, par.vals = mytune$x)
params2 <- list(booster = "gbtree",
               objective = lrn_tune$par.vals$objective,
               eta=lrn_tune$par.vals$eta, gamma=0,
               max_depth=lrn_tune$par.vals$max_depth,
               min_child_weight=lrn_tune$par.vals$min_child_weight,
               subsample = 0.8,
               colsample_bytree=lrn_tune$par.vals$colsample_bytree)

xgb2 <- xgb.train(params = params2,
                   data = dtrain, nrounds = 50,
                   watchlist = list(val=dtune, train=dtrain),
                   print_every_n = 10, early_stopping_rounds = 50,
                   maximize = FALSE, eval_metric = "error")

模型训练好后,我使用 predict() 将模型应用于验证数据:

Once the model is trained I apply the model to the validation data with predict():

xgbpred2_keep <- predict(xgb2, dval)

xg2_val <- data.frame("Prediction" = xgbpred2_keep,
                      "Patient" = rownames(val),
                      "Response" = val_data$response)

# Reorder Patients according to Response
xg2_val$Patient <- factor(xg2_val$Patient,
                          levels = xg2_val$Patient[order(xg2_val$Response)])

ggplot(xg2_val, aes(x = Patient, y = Prediction,
                    fill = Response)) +
  geom_bar(stat = "identity") +
  theme_bw(base_size = 16) +
  labs(title=paste("Patient predictions (xgb2) for the validation dataset (n = ",
                   length(rownames(val)), ")", sep = ""), 
       subtitle="Above 0.5 = Non-Responder, Below 0.5 = Responder", 
       caption=paste("JM", Sys.Date(), sep = " "),
       x = "") +
  theme(axis.text.x = element_text(angle=90, vjust=0.5,
                                   hjust = 1, size = 8)) +
# Distance from red line = confidence of prediction
  geom_hline(yintercept = 0.5, colour = "red")


# Convert predictions to binary outcome (responder / non-responder)
xgbpred2_binary <- ifelse(predict(xgb2, dval) > 0.5,1,0)

# Results matrix (i.e. true positives/negatives & false positives/negatives)
confusionMatrix(as.factor(xgbpred2_binary), as.factor(labels_tv))


# Summary of results
Summary_of_results <- data.frame(Patient_ID = rownames(val),
                                 label = labels_tv, 
                                 pred = xgbpred2_binary)
Summary_of_results$eval <- ifelse(
  Summary_of_results$label != Summary_of_results$pred,
  "wrong",
  "correct")
Summary_of_results$conf <- round(predict(xgb2, dval), 2)
Summary_of_results$CDS <- val_data$`variants`
Summary_of_results

这为您提供了模型在您的验证数据上的工作"效果摘要.

This provides you with a summary of how well the model 'works' on your validation data.

这篇关于在 R 中的 CARET 中训练、验证、测试拆分模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆