了解scikit学习GridSearchCV-参数调整和平均性能指标 [英] Understanding scikit-learn GridSearchCV - param tuning and averaging performance metrics

查看:105
本文介绍了了解scikit学习GridSearchCV-参数调整和平均性能指标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解scikit-learn中的GridSearchCV如何在机器学习中实现火车验证测试原理.如您在以下代码中看到的,我理解它的作用如下:

  1. 将数据集"分为75%和25%,其中75%用于参数调整,而25%是保留的测试集(第1行)
  2. 初始化一些参数进行搜索(第3至6行)
  3. 将模型拟合到数据集的75%,但将该数据集分成5折,即每次训练60%的数据,测试其他15%的数据,然后进行5次(第8-10行)).我在这里有第一个和第二个问题,请参见下文.
  4. 采用性能最佳的模型和参数,对保持数据进行测试(第11-13行)

问题1 :关于参数空间,第3步究竟发生了什么?GridSearchCV是否在5次运行(5次)的每一次中尝试每个参数组合,以便总共进行10次运行?(即,来自优化器",初始"和批次"的单个参数与来自史诗"的2个参数配对)

问题2 :"cross_val_score"行的得分是多少?这是上述10次运行的平均值,是否是5次运行中的每一次数据的单倍?(即整个数据集的五个平均值的15%)?

问题3 :假设第5行现在只有1个参数值,这一次GridSearchCV实际上不搜索任何参数,因为每个参数只有1个值,这是正确的吗?

问题4 :在问题3中进行了解释的情况下,如果我们对GridSearchCV运行次数和坚持输出次数的5倍计算得出的分数进行加权平均,那么我们得出的平均绩效得分在整个数据集上-这与6倍交叉验证实验非常相似(即,不进行格网搜索),只是6倍的大小并不完全相等.或者这不是?

在此先感谢您的答复!

  X_train_data,X_test_data,y_train,y_test = \train_test_split(数据集[:,0:8],数据集[:,8],test_size = 0.25,random_state = 42)#行1模型= KerasClassifier(build_fn = create_model,verbose = 0)优化程序= ['adam'] #line 3初始化= ['制服']epochs = [10,20] #line 5批次= [5]#行6param_grid = dict(optimizer =优化器,epochs = epochs,batch_size = batches,init = init)grid = GridSearchCV(estimator = model,param_grid = param_grid,cv = 5)#第8行grid_result = grid.fit(X_train_data,y_train)cross_val_score(grid.best_estimator_,X_train_data,y_train,cv = 5).mean()#行10best_param_ann = grid.best_params_ #line 11best_estimator = grid.best_estimator_holdout_predictions = best_estimator.predict(X_test_data)#行13 

解决方案

问题1:如您所说,您的数据集将被分为5个部分.将尝试每个参数(在您的情况下为2).对于每个参数,将在5折中的4折上训练模型.其余的将用作测试.因此,您是对的,在您的示例中,您将要训练10次模型.

问题2:"cross_val_score"是5次测试折叠的平均值(准确性,损失或其他).这样做是为了避免仅仅因为测试集非常简单而获得良好的结果.

问题3:是的.如果只有一组参数来尝试进行网格搜索,这没有任何意义

问题4:我没有完全理解你的问题.通常,您在火车上使用网格搜索.这使您可以将测试集保留为验证集.如果没有交叉验证,您可能会找到一个完美的设置来最大化测试集的结果,并且会过度拟合测试集.通过交叉验证,您可以使用微调参数播放任意数量的内容,因为您无需使用验证集进行设置.

在您的代码中,不需要太多的CV,因为您没有很多参数要使用,但是如果您开始添加正则化,则可以尝试10+,在这种情况下,需要CV./p>

我希望对您有帮助,

I am trying to understand how exactly the GridSearchCV in scikit-learn implements the train-validation-test principle in machine learning. As you see in the following code, I understand what it does is as follows:

  1. split the 'dataset' into 75% and 25%, where 75% is used for param tuning, and 25% is the held out test set (line 1)
  2. init some parameters to search (lines 3 to 6)
  3. fit the model on the 75% of dataset, but split this dataset into 5 folds, i.e., each time train on 60% of the data, test on the other 15%, and do this 5 times (lines 8 - 10). I have my first and second questions here, see below.
  4. take the best performing model and parameters, test on the holdout data (lines 11-13)

Question 1: what is exactly going on in step 3 with respect to the parameter space? Is GridSearchCV trying every parameter combination on every one of the five runs (5-fold) so giving a total of 10 runs? (i.e., the single param from 'optmizers', 'init', and 'batches' is paired with the 2 from 'epoches']

Question 2: what scores does line 'cross_val_score' print? Is this the average of the 10 above runs on the single fold of the data in each of the 5 runs? (i.e., the average of five 15% of the entire dataset)?

Question 3: suppose line 5 now has only 1 parameter value, this time GridSearchCV is really not searching any parameters because each parameter has only 1 value, is this correcct?

Question 4: in case explained in question 3, if we take a weighted average of the scores computed on the 5-folds of GridSearchCV runs and the heldout run, that gives us an average peformance score on the entire dataset - this is very similar to a 6-fold cross-validation experiment (i.e., without gridsearch), except the 6 fold are not entirely equal size. Or is this not?

Many thanks in advance for any replies!

X_train_data, X_test_data, y_train, y_test = \
         train_test_split(dataset[:,0:8], dataset[:,8],
                          test_size=0.25,
                          random_state=42) #line 1

model = KerasClassifier(build_fn=create_model, verbose=0)
optimizers = ['adam']  #line 3
init = ['uniform']
epochs = [10,20] #line 5
batches = [5]   # line 6
param_grid = dict(optimizer=optimizers, epochs=epochs, batch_size=batches, init=init)
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)  # line 8
grid_result = grid.fit(X_train_data, y_train) 
cross_val_score(grid.best_estimator_, X_train_data, y_train, cv=5).mean() #line 10
best_param_ann = grid.best_params_      #line 11
best_estimator = grid.best_estimator_
heldout_predictions = best_estimator.predict(X_test_data)   #line 13

解决方案

Question 1: As you said, you dataset will be split in 5 pieces. Every parameters will be tried (in your case 2). For each parameters, model will be trained on 4 of the 5 folds. The remaining one will be used as test. So you are right, in your example, you are going to train 10 times a model.

Question 2: 'cross_val_score' is the average (accuracy, loss or something) on the 5 test folds. This is done to avoid having for example a good result just because the test set was really easy.

Question 3: Yes. It makes no sense if you have only one set of parameter to try to do a grid search

Question 4: I didn't exactly understand your question. Usually, you use a grid search on your train set. This allows you to keep your test set as a validation set. Without cross validation, you could find a perfect setting to maximise results on your test set and you would be overfitting your test set. With a cross validation, you can play as much as you want with fine-tuning parameter as you don't use your validation set to set it up.

In your code, there is no big need of CV as you don't have a lot of parameters to play with, but if you start adding regularization, you may try 10+ and in such case, CV is required.

I hope it helps,

这篇关于了解scikit学习GridSearchCV-参数调整和平均性能指标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆