为什么sklearn.feature_selection.RFECV每次运行都会给出不同的结果 [英] why is sklearn.feature_selection.RFECV giving different results for each run

查看:821
本文介绍了为什么sklearn.feature_selection.RFECV每次运行都会给出不同的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用 RFECV 进行特征选择,但每次给出的结果都不相同,交叉验证将样本X分为随机块还是顺序确定性块?

I tried to do feature selection with RFECV but it is giving out different results each time, does cross-validation divide the sample X into random chunks or into sequential deterministic chunks?

此外,为什么 grid_scores _ score(X,y )?为什么分数有时为负?

Also, why is the score different for grid_scores_ and score(X,y)? why are the scores sometimes negative?

推荐答案

交叉验证是否将样本X分为随机块或按顺序确定性

CV默认将数据划分为确定性块。您可以通过将shuffle参数设置为True来更改此行为。

CV divides the data into deterministic chunks by default. You can change this behaviour by setting the shuffle parameter to True.

但是, RFECV 如果是二进制或多类,则使用 sklearn.model_selection.StratifiedKFold

这意味着它将对数据进行拆分,使每一折具有相同(或几乎相同的类比率)。为此,在CV的不同迭代中,每个折叠中的确切数据可能会略有变化。但是,这不会引起数据的重大变化。

This means that it will split the data such that each fold has the same (or nearly the same ratio of classes). In order to do this, the exact data in each fold can change slightly in different iterations of CV. However, this should not cause major changes in the data.

如果要使用cv参数传递CV迭代器,则可以通过指定随机状态来修复拆分。随机状态链接到算法做出的随机决策。每次使用相同的随机状态将确保相同的行为。

If you are passing a CV iterator using the cv parameter, then you can fix the splits by specifying a random state. The random state is linked to random decisions made by the algorithm. Using the same random state each time will ensure the same behaviour.

此外,为什么grid_scores_和score(X,y)的分数不同?

grid_scores_是一组交叉验证分数。 grid_scores_ [i]是第i次迭代的交叉验证得分。这意味着第一个分数是所有要素的分数,第二个分数是一组要素被删除等时的分数。每个中删除的特征的数量等于step参数的值。默认情况下,该值= 1。

grid_scores_ is an array of cross-validation scores. grid_scores_[i] is the cross-validation score for the i-th iteration. This means that the first score is the score for all features, the second is the score when one set of features is removed and so on. The number of features removed in each is equal to the value of the step parameter. This is = 1 by default.

score(X,y)选择最佳特征数并返回这些特征的分数。

score(X, y) selects the optimal number of features and returns the score for those features.

为什么分数有时为负?

这取决于您使用的估算器和计分器。如果未设置任何评分器,RFECV将使用默认评分函数作为估算器。通常,这是准确性,但是在您的特定情况下,可能会返回负值。

This depends on the estimator and scorer you are using. If you have set no scorer RFECV will use the default score function for the estimator. Generally, this is accuracy, but in your particular case, might be something that returns a negative value.

这篇关于为什么sklearn.feature_selection.RFECV每次运行都会给出不同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆