插入子集上的r插入符号估计参数适合完整数据 [英] r caret estimate parameters on a subset fit to full data

查看:126
本文介绍了插入子集上的r插入符号估计参数适合完整数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个550k项的数据集,我将500k用于训练而将50k用于测试.在训练阶段,必须建立每种算法的参数值的最佳"组合.与其使用整个500k,我不愿意使用一个子集,但是在训练最终模型时,我会使用最佳"组合,而是使用整个500k.在伪代码中,任务看起来像:

I have a dataset of 550k items that I split 500k for training and 50k for testing. During the training stage it is necessary to establish the 'best' combination of each algorithms' parameter values. Rather than use the entire 500k for this I'd be happy to use a subset, BUT when it comes to training the final model, with the 'best' combination, I'd like to use the full 500k. In pseudo code the task looks like:

subset the 500k training data to 50k
for each combination of model parameters (3, 6, or 9)
  for each repeat (3)
    for each fold (10)
       fit the model on 50k training data using the 9 folds
       evaluate performance on the remaining fold
establish the best combination of parameters
fit to all 500k using best combination of parameters

为此,我需要告诉插入符号,在进行优化之前,它应该对数据进行子集化,但为了最终拟合,请使用所有数据.

To do this I need to tell caret that prior to optimisation it should subset the data but for the final fit, use all the data.

我可以通过以下方法做到这一点:(1)设置数据子集; (2)进行通常的训练; (3)停止最终拟合(不需要); (4)建立最佳"组合(这是在火车的输出中); (5)在没有优化参数的情况下以全500k的速度运行火车.

I can do this by: (1) subsetting the data; (2) do the usual train stages; (3) stop the final fit (not needed); (4) establish the 'best' combination (this is in the output of the train); (5) run train on the full 500k with no parameter optimisation.

这有点不整洁,我也不知道如何停止对插入符号的最终模型的训练,这是我永远不会使用的.

This is a bit untidy and I don't know how to stop caret training the final model, which I will never use.

推荐答案

可以通过指定trainControl的index,indexOut和indexFinal参数来实现.

This is possible by specifying the index, indexOut and indexFinal arguments to trainControl.

以下是使用mlbench库中的Sonar数据集的示例:

Here is an example using the Sonar data set from mlbench library:

library(caret)
library(mlbench)
data(Sonar)

让我们说我们想每次训练都提取一半的Sonar数据集,并重复10次:

Lets say we want to draw half of the Sonar data set each time for training, and repeat that 10 times:

train_inds <- replicate(10, sample(1:nrow(Sonar), size = nrow(Sonar)/2), simplify = FALSE)

如果您对其他抽样方法感兴趣,请发布详细信息.这仅用于说明.

If you are interested in a different sampling approach please post the details. This is for illustration only.

为了进行测试,我们将在train_inds中使用随机的10行:

For testing we will use random 10 rows not in the train_inds:

test_inds <- lapply(train_inds, function(x){
  inds <- setdiff(1:nrow(Sonar), x)
  return(sample(inds, size = 10))
}
)

现在只需在trainControl中指定test_inds和train_inds:

now just specify the test_inds and train_inds in trainControl:

ctrl <-  trainControl(
    method = "boot",
    number = 10,
    classProbs = T,
    savePredictions = "final",
    index = train_inds,
    indexOut = test_inds,
    indexFinal = 1:nrow(Sonar),
    summaryFunction = twoClassSummary
  )

如果您不想在所有行上都适合最终模型,还可以指定indexFinal.

you can also specify indexFinal if you do not wish to fit the final model on all rows.

适合:

model <- train(
    Class ~ .,
    data = Sonar,
    method = "rf",
    trControl = ctrl,
    metric = "ROC"
  )
model
#output
Random Forest 

208 samples, 208 used for final model
 60 predictor
  2 classes: 'M', 'R' 

No pre-processing
Resampling: Bootstrapped (10 reps) 
Summary of sample sizes: 104, 104, 104, 104, 104, 104, ... 
Resampling results across tuning parameters:

  mtry  ROC        Sens    Spec     
   2    0.9104167  0.7750  0.8250000
  31    0.9125000  0.7875  0.7916667
  60    0.9083333  0.7875  0.8166667

这篇关于插入子集上的r插入符号估计参数适合完整数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆