使用 GridSearch 时使用 Scikit-learn 的模型帮助 [英] Model help using Scikit-learn when using GridSearch

查看:39
本文介绍了使用 GridSearch 时使用 Scikit-learn 的模型帮助的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为安然项目的一部分,构建了附加模型,以下是步骤的总结,

As part of the Enron project, built the attached model, Below is the summary of the steps,

cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)

gcv.fit(features,labels) ---> with the full dataset

for train_ind, test_ind in cv.split(features,labels):
    x_train, x_test = features[train_ind], features[test_ind]
    y_train, y_test = labels[train_ind],labels[test_ind]

    gcv.best_estimator_.predict(x_test)

以下模型给出了更合理但较低的分数

cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)

gcv.fit(features,labels) ---> with the full dataset

for train_ind, test_ind in cv.split(features,labels):
     x_train, x_test = features[train_ind], features[test_ind]
     y_train, y_test = labels[train_ind],labels[test_ind]

     gcv.best_estimator_.fit(x_train,y_train)
     gcv.best_estimator_.predict(x_test)

  1. 使用 Kbest 找出分数并对特征进行排序,并尝试组合较高和较低的分数.

  1. Used Kbest to find out the scores and sorted the features and trying a combination of higher and lower scores.

将 SVM 与使用 StratifiedShuffle 的 GridSearch 结合使用

Used SVM with a GridSearch using a StratifiedShuffle

使用 best_estimator_ 来预测和计算准确率和召回率.

Used the best_estimator_ to predict and calculate the precision and recall.

问题是 estimator 输出完美的分数,在某些情况下是 1

The problem is estimator is spitting out perfect scores, in some case 1

但是当我在训练数据上重新调整最好的分类器然后运行测试时,它给出了合理的分数.

But when I refit the best classifier on training data then run the test it gives reasonable scores.

我的疑问/问题是使用我们发送给它的 Shuffle 拆分对象拆分后,GridSearch 对测试数据究竟做了什么.我假设它不适合测试数据,如果这是真的,那么当我使用相同的测试数据进行预测时,它不应该给出这么高的分数,对吧.?因为我使用了 random_state 值,shufflesplit 应该为网格拟合和预测创建相同的副本.

My doubt/question was what exactly GridSearch does with the test data after the split using the Shuffle split object we send in to it. I assumed it would not fit anything on Test data, if that was true then when I predict using the same test data, it should not give this high scores right.? since i used random_state value, the shufflesplit should have created the same copy for the Grid fit and also for the predict.

那么,对两个错误使用相同的 Shufflesplit 吗?

So, is using the same Shufflesplit for two wrong?

推荐答案

基本上网格搜索将:

  • 尝试各种参数网格组合
  • 对于它们中的每一个,它都会进行 K 折交叉验证
  • 选择最好的.

所以你的第二个案例是好的.否则你实际上是在预测你训练的数据(第二个选项不是这种情况,你只保留网格搜索中的最佳参数)

So your second case is the good one. Otherwise you are actually predicting data that you trained with (which is not the case in the second option, there you only keep the best parameters from your gridsearch)

这篇关于使用 GridSearch 时使用 Scikit-learn 的模型帮助的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆