了解scikit学习GridSearchCV-参数调整和平均性能指标 [英] Understanding scikit-learn GridSearchCV - param tuning and averaging performance metrics

查看：105 发布时间：2021/5/31 18:40:06 machine-learning scikit-learn grid-search

本文介绍了了解scikit学习GridSearchCV-参数调整和平均性能指标的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图了解scikit-learn中的GridSearchCV如何在机器学习中实现火车验证测试原理.如您在以下代码中看到的，我理解它的作用如下:

将数据集"分为75％和25％，其中75％用于参数调整，而25％是保留的测试集(第1行)
初始化一些参数进行搜索(第3至6行)
将模型拟合到数据集的75％，但将该数据集分成5折，即每次训练60％的数据，测试其他15％的数据，然后进行5次(第8-10行)).我在这里有第一个和第二个问题，请参见下文.
采用性能最佳的模型和参数，对保持数据进行测试(第11-13行)

问题1 :关于参数空间，第3步究竟发生了什么?GridSearchCV是否在5次运行(5次)的每一次中尝试每个参数组合，以便总共进行10次运行?(即，来自优化器"，初始"和批次"的单个参数与来自史诗"的2个参数配对)

问题2 :"cross_val_score"行的得分是多少?这是上述10次运行的平均值，是否是5次运行中的每一次数据的单倍?(即整个数据集的五个平均值的15％)?

问题3 :假设第5行现在只有1个参数值，这一次GridSearchCV实际上不搜索任何参数，因为每个参数只有1个值，这是正确的吗?

问题4 :在问题3中进行了解释的情况下，如果我们对GridSearchCV运行次数和坚持输出次数的5倍计算得出的分数进行加权平均，那么我们得出的平均绩效得分在整个数据集上-这与6倍交叉验证实验非常相似(即，不进行格网搜索)，只是6倍的大小并不完全相等.或者这不是?

在此先感谢您的答复！

  X_train_data，X_test_data，y_train，y_test = \train_test_split(数据集[:，0:8]，数据集[:，8]，test_size = 0.25，random_state = 42)#行1模型= KerasClassifier(build_fn = create_model，verbose = 0)优化程序= ['adam'] #line 3初始化= ['制服']epochs = [10,20] #line 5批次= [5]#行6param_grid = dict(optimizer =优化器，epochs = epochs，batch_size = batches，init = init)grid = GridSearchCV(estimator = model，param_grid = param_grid，cv = 5)#第8行grid_result = grid.fit(X_train_data，y_train)cross_val_score(grid.best_estimator_，X_train_data，y_train，cv = 5).mean()#行10best_param_ann = grid.best_params_ #line 11best_estimator = grid.best_estimator_holdout_predictions = best_estimator.predict(X_test_data)#行13

解决方案

问题1:如您所说，您的数据集将被分为5个部分.将尝试每个参数(在您的情况下为2).对于每个参数，将在5折中的4折上训练模型.其余的将用作测试.因此，您是对的，在您的示例中，您将要训练10次模型.

问题2:"cross_val_score"是5次测试折叠的平均值(准确性，损失或其他).这样做是为了避免仅仅因为测试集非常简单而获得良好的结果.

问题3:是的.如果只有一组参数来尝试进行网格搜索，这没有任何意义

问题4:我没有完全理解你的问题.通常，您在火车上使用网格搜索.这使您可以将测试集保留为验证集.如果没有交叉验证，您可能会找到一个完美的设置来最大化测试集的结果，并且会过度拟合测试集.通过交叉验证，您可以使用微调参数播放任意数量的内容，因为您无需使用验证集进行设置.

在您的代码中，不需要太多的CV，因为您没有很多参数要使用，但是如果您开始添加正则化，则可以尝试10+，在这种情况下，需要CV./p>

我希望对您有帮助，

I am trying to understand how exactly the GridSearchCV in scikit-learn implements the train-validation-test principle in machine learning. As you see in the following code, I understand what it does is as follows:

split the 'dataset' into 75% and 25%, where 75% is used for param tuning, and 25% is the held out test set (line 1)
init some parameters to search (lines 3 to 6)
fit the model on the 75% of dataset, but split this dataset into 5 folds, i.e., each time train on 60% of the data, test on the other 15%, and do this 5 times (lines 8 - 10). I have my first and second questions here, see below.
take the best performing model and parameters, test on the holdout data (lines 11-13)

Question 1: what is exactly going on in step 3 with respect to the parameter space? Is GridSearchCV trying every parameter combination on every one of the five runs (5-fold) so giving a total of 10 runs? (i.e., the single param from 'optmizers', 'init', and 'batches' is paired with the 2 from 'epoches']

Question 2: what scores does line 'cross_val_score' print? Is this the average of the 10 above runs on the single fold of the data in each of the 5 runs? (i.e., the average of five 15% of the entire dataset)?

Question 3: suppose line 5 now has only 1 parameter value, this time GridSearchCV is really not searching any parameters because each parameter has only 1 value, is this correcct?

Question 4: in case explained in question 3, if we take a weighted average of the scores computed on the 5-folds of GridSearchCV runs and the heldout run, that gives us an average peformance score on the entire dataset - this is very similar to a 6-fold cross-validation experiment (i.e., without gridsearch), except the 6 fold are not entirely equal size. Or is this not?

Many thanks in advance for any replies!

X_train_data, X_test_data, y_train, y_test = \
         train_test_split(dataset[:,0:8], dataset[:,8],
                          test_size=0.25,
                          random_state=42) #line 1

model = KerasClassifier(build_fn=create_model, verbose=0)
optimizers = ['adam']  #line 3
init = ['uniform']
epochs = [10,20] #line 5
batches = [5]   # line 6
param_grid = dict(optimizer=optimizers, epochs=epochs, batch_size=batches, init=init)
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)  # line 8
grid_result = grid.fit(X_train_data, y_train) 
cross_val_score(grid.best_estimator_, X_train_data, y_train, cv=5).mean() #line 10
best_param_ann = grid.best_params_      #line 11
best_estimator = grid.best_estimator_
heldout_predictions = best_estimator.predict(X_test_data)   #line 13

解决方案

Question 1: As you said, you dataset will be split in 5 pieces. Every parameters will be tried (in your case 2). For each parameters, model will be trained on 4 of the 5 folds. The remaining one will be used as test. So you are right, in your example, you are going to train 10 times a model.

Question 2: 'cross_val_score' is the average (accuracy, loss or something) on the 5 test folds. This is done to avoid having for example a good result just because the test set was really easy.

Question 3: Yes. It makes no sense if you have only one set of parameter to try to do a grid search

Question 4: I didn't exactly understand your question. Usually, you use a grid search on your train set. This allows you to keep your test set as a validation set. Without cross validation, you could find a perfect setting to maximise results on your test set and you would be overfitting your test set. With a cross validation, you can play as much as you want with fine-tuning parameter as you don't use your validation set to set it up.

In your code, there is no big need of CV as you don't have a lot of parameters to play with, but if you start adding regularization, you may try 10+ and in such case, CV is required.

I hope it helps,

这篇关于了解scikit学习GridSearchCV-参数调整和平均性能指标的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

了解scikit学习GridSearchCV-参数调整和平均性能指标 [英] Understanding scikit-learn GridSearchCV - param tuning and averaging performance metrics

问题描述

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

了解scikit学习GridSearchCV-参数调整和平均性能指标 [英] Understanding scikit-learn GridSearchCV - param tuning and averaging performance metrics

问题描述

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭