在R中整合不同的数据集 [英] Ensemble different datasets in R

查看:269
本文介绍了在R中整合不同的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用所描述的示例来组合来自不同模型的信号这里.我有不同的数据集,它们预测相同的输出.但是,当我在caretList中组合模型输出并整合信号时,会给出错误

I am trying to combine signals from different models using the example described here . I have different datasets which predicts the same output. However, when I combine the model output in caretList, and ensemble the signals, it gives an error

Error in check_bestpreds_resamples(modelLibrary) : 
  Component models do not have the same re-sampling strategies

这是可复制的示例

library(caret)
library(caretEnsemble)
df1 <-
  data.frame(x1 = rnorm(200),
             x2 = rnorm(200),
             y = as.factor(sample(c("Jack", "Jill"), 200, replace = T)))

df2 <-
  data.frame(z1 = rnorm(400),
             z2 = rnorm(400),
             y = as.factor(sample(c("Jack", "Jill"), 400, replace = T)))

library(caret)
check_1 <- train( x = df1[,1:2],y = df1[,3],
                 method = "nnet",
                 tuneLength = 10,
                 trControl = trainControl(method = "cv",
                                          classProbs = TRUE,
                                          savePredictions = T))

check_2 <- train( x = df2[,1:2],y = df2[,3] ,
                 method = "nnet",
                 preProcess = c("center", "scale"),
                 tuneLength = 10,
                 trControl = trainControl(method = "cv",
                                          classProbs = TRUE,
                                          savePredictions = T))


combine <- c(check_1, check_2)
ens <- caretEnsemble(combine)

推荐答案

首先,您要尝试结合在不同训练数据集上训练的2个模型.那是行不通的.所有集成模型都需要基于相同的训练集.在每个训练有素的模型中,您将有不同的重采样集.因此,您当前的错误.

First of all, you are trying to combine 2 models trained on different training data sets. That is not going to work. All ensemble models will need to be based on the same training set. You will have different sets of resamples in each trained model. Hence your current error.

在不使用caretList的情况下构建模型也是很危险的,因为获得不同的重采样策略将有很大的变化.您可以通过使用trainControl中的索引来更好地控制它(请参阅

Also building your models without using caretList is dangerous because you will have a big change of getting different resample strategies. You can control that a bit better by using the index in trainControl (see vignette).

如果使用1个数据集,则可以使用以下代码:

If you use 1 dataset you can use the following code:

ctrl <- trainControl(method = "cv",
                     number = 5,
                     classProbs = TRUE,
                     savePredictions = "final")

set.seed(1324)
# will generate the following warning:
# indexes not defined in trControl.  Attempting to set them ourselves, so 
# each model in the ensemble will have the same resampling indexes.
models <- caretList(x = df1[,1:2],
                    y = df1[,3] ,
                    trControl = ctrl,
                    tuneList = list(
                      check_1 = caretModelSpec(method = "nnet", tuneLength = 10),
                      check_2 = caretModelSpec(method = "nnet", tuneLength = 10, preProcess = c("center", "scale"))
                    )) 


ens <- caretEnsemble(models)


A glm ensemble of 2 base models: nnet, nnet

Ensemble results:
Generalized Linear Model 

200 samples
  2 predictor
  2 classes: 'Jack', 'Jill' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results:

  Accuracy   Kappa     
  0.5249231  0.04164767

还请阅读本指南,以了解不同的集成策略.

Also read this guide on different ensemble strategies.

这篇关于在R中整合不同的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆