使用GridSearch时使用Scikit-learn建立模型帮助 [英] Model help using Scikit-learn when using GridSearch

查看:127
本文介绍了使用GridSearch时使用Scikit-learn建立模型帮助的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为Enron项目的一部分,构建了附加模型,以下是步骤的摘要,

As part of the Enron project, built the attached model, Below is the summary of the steps,

cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)

gcv.fit(features,labels) ---> with the full dataset

for train_ind, test_ind in cv.split(features,labels):
    x_train, x_test = features[train_ind], features[test_ind]
    y_train, y_test = labels[train_ind],labels[test_ind]

    gcv.best_estimator_.predict(x_test)

下面的模型给出了更合理但得分更低的

cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)

gcv.fit(features,labels) ---> with the full dataset

for train_ind, test_ind in cv.split(features,labels):
     x_train, x_test = features[train_ind], features[test_ind]
     y_train, y_test = labels[train_ind],labels[test_ind]

     gcv.best_estimator_.fit(x_train,y_train)
     gcv.best_estimator_.predict(x_test)

  1. 使用Kbest找出分数并对其功能进行排序,并尝试组合较高和较低的分数.

  1. Used Kbest to find out the scores and sorted the features and trying a combination of higher and lower scores.

通过StratifiedShuffle将SVM与GridSearch一起使用

Used SVM with a GridSearch using a StratifiedShuffle

使用best_estimator_来预测和计算精度以及召回率.

Used the best_estimator_ to predict and calculate the precision and recall.

问题在于估算器会吐出完美的分数,在某些情况下为1

The problem is estimator is spitting out perfect scores, in some case 1

但是,当我在训练数据上重新拟合最佳分类器然后运行测试时,它会给出合理的分数.

But when I refit the best classifier on training data then run the test it gives reasonable scores.

我的疑问/问题是,在使用我们发送给它的Shuffle拆分对象进行拆分之后,GridSearch究竟对测试数据做了什么.我以为它不适合测试数据,如果是真的,那么当我预测使用相同的测试数据时,它应该不会给出如此高的分数.由于我使用random_state值,因此shufflesplit应该为Grid Fit和预测创建了相同的副本.

My doubt/question was what exactly GridSearch does with the test data after the split using the Shuffle split object we send in to it. I assumed it would not fit anything on Test data, if that was true then when I predict using the same test data, it should not give this high scores right.? since i used random_state value, the shufflesplit should have created the same copy for the Grid fit and also for the predict.

那么,是否对两个错误使用相同的Shufflesplit?

So, is using the same Shufflesplit for two wrong?

推荐答案

基本上,网格搜索将:

  • 尝试参数网格的所有组合
  • 对于每个人,都将进行K折交叉验证
  • 选择最好的.

所以你的第二种情况是好的.否则,您实际上是在预测训练的数据(在第二种方法中不是这种情况,那里只保留了来自gridsearch的最佳参数)

So your second case is the good one. Otherwise you are actually predicting data that you trained with (which is not the case in the second option, there you only keep the best parameters from your gridsearch)

这篇关于使用GridSearch时使用Scikit-learn建立模型帮助的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆