应该在训练测试拆分之前还是之后进行特征选择? [英] Should Feature Selection be done before Train-Test Split or after?

查看:1115
本文介绍了应该在训练测试拆分之前还是之后进行特征选择?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

实际上,存在两个事实的矛盾,这是该问题的可能答案:

Actually, there is a contradiction of 2 facts that are the possible answers to the question:

  1. 常规的答案是在拆分后执行此操作,因为如果以前这样做,可能会导致来自测试集的信息泄漏.

  1. The conventional answer is to do it after splitting as there can be information leakage, if done before, from the Test-Set.

矛盾的答案是,如果仅使用从整个数据集中选择的训练集进行特征选择,那么特征选择或特征重要性得分顺序可能会随着Train_Test_Split的random_state的变化而动态变化. .并且,如果针对任何特定工作的特征选择发生了变化,则无法进行特征重要性的一般化,这是不希望的.其次,如果仅训练集用于特征选择,则测试集可能包含某些实例集,这些实例与仅在训练集上完成的特征选择抗辩/矛盾,因为未分析总体历史数据.而且,仅在给定一组实例而不是单个测试/未知实例的情况下,才可以评估功能重要性评分.

The contradicting answer is that, if only the Training Set chosen from the whole dataset is used for Feature Selection, then the feature selection or feature importance score orders is likely to be dynamically changed with change in random_state of the Train_Test_Split. And if the feature selection for any particular work changes, then no Generalization of Feature Importance can be done, which is not desirable. Secondly, if only Training Set is used for feature selection, then the test set may contain certain set of instances that defies/contradicts the feature selection done only on the Training Set as the overall historical data is not analyzed. Moreover, feature importance scores can only be evaluated when, given a set of instances rather than a single test/unknown instance.

推荐答案

常规答案#1在这里是正确的;矛盾的答案2中的论点实际上并不成立.

The conventional answer #1 is correct here; the arguments in the contradicting answer #2 do not actually hold.

如果有这样的疑问,可以想象一下,在模型拟合过程中,您只是没有在任何测试集中都没有任何访问权限(其中包括功能的重要性);您应该将测试集视为字面上看不见的数据(并且由于看不见,因此无法用于特征重要性评分).

When having such doubts, it is useful to imagine that you simply do not have any access in any test set during the model fitting process (which includes feature importance); you should treat the test set as literally unseen data (and, since unseen, they could not have been used for feature importance scores).

Hastie&提比斯拉尼(Tibshirani)很久以前就明确地提出了正确的&执行此类过程的错误方式;我已经在博客文章如何不执行功能选择中总结了这个问题!-尽管讨论是关于交叉验证的,但可以很容易地看出,这些参数也适用于训练/测试拆分的情况.

Hastie & Tibshirani have clearly argued long ago about the correct & wrong way to perform such processes; I have summarized the issue in a blog post, How NOT to perform feature selection! - and although the discussion is about cross-validation, it can be easily seen that the arguments hold for the case of train/test split, too.

您矛盾的答案#2中唯一存在的论点是

The only argument that actually holds in your contradicting answer #2 is that

未分析整体历史数据

the overall historical data is not analyzed

尽管如此,这是为了拥有一个独立的绩效评估测试集而必须付出的代价,否则,按照相同的逻辑,我们也应该将测试集用于 training ,不是吗?

Nevertheless, this is the necessary price to pay in order to have an independent test set for performance assessment, otherwise, with the same logic, we should use the test set for training, too, shouldn't we?

总结:测试集仅用于模型的性能评估,并且不能在模型构建的任何阶段(包括特征选择)都不应使用.

Wrap up: the test set is there solely for performance assessment of your model, and it should not be used in any stage of model building, including feature selection.

更新(在评论后):

测试集中的趋势可能不同

the trends in the Test Set may be different

这里的标准(但通常是隐含的)假设是培训和培训.测试集在质量上相似;正是由于这个假设,我们觉得只使用简单的随机分割来获得它们就可以了.如果我们有理由相信我们的数据发生了重大变化(不仅在训练与测试之间,而且在模型部署期间),那么整个原理就会崩溃,因此需要完全不同的方法.

A standard (but often implicit) assumption here is that the training & test sets are qualitatively similar; it is exactly due to this assumption that we feel OK to just use simple random splits to get them. If we have reasons to believe that our data change in significant ways (not only between train & test, but during model deployment, too), the whole rationale breaks down, and completely different approaches are required.

此外,这样做可能会导致过度拟合

Also, on doing so, there can be a high probability of Over-fitting

过分拟合的唯一确定方式是在管道过程中以任何方式使用测试集(包括您所建议的功能选择).可以说,链接的博客文章具有足够的论点(包括引号和链接)来令人信服.经典示例,过拟合的危险或如何在1分钟内掉落50个斑点:

The only certain way of overfitting is to use the test set in any way during the pipeline (including for feature selection, as you suggest). Arguably, the linked blog post has enough arguments (including quotes & links) to be convincing. Classic example, the testimony in The Dangers of Overfitting or How to Drop 50 spots in 1 minute:

随着比赛的进行,我开始使用更多的功能选择和预处理.但是,我在交叉验证方法中犯了经典错误,方法是不将其包含在交叉验证折叠中(有关此错误的更多信息,请参见统计学习的要素).这导致越来越乐观的交叉验证估计.

as the competition went on, I began to use much more feature selection and preprocessing. However, I made the classic mistake in my cross-validation method by not including this in the cross-validation folds (for more on this mistake, see this short description or section 7.10.2 in The Elements of Statistical Learning). This lead to increasingly optimistic cross-validation estimates.

正如我已经说过的,尽管这里的讨论是关于交叉验证的,但要说服自己也完全适用于培训/测试案例并不难,

As I have already said, although the discussion here is about cross-validation, it should not be difficult to convince yourself that it perfectly applies to the train/test case, too.

功能选择应以增强模型性能的方式进行

feature selection should be done in such a way that Model Performance is enhanced

当然,没有人可以对此争论!问题是-我们在说什么确切的性能?因为上面引用的Kaggler在执行过程中(应用错误的程序)确实获得了更好的性能",直到他的模型面临真实的情况为止. 看不见的数据(关键时刻!),并且失败也就不足为奇了.

Well, nobody can argue with this, of course! The catch is - which exact performance are we talking about? Because the Kaggler quoted above was indeed getting better "performance" as he was going along (applying a mistaken procedure), until his model was faced with real unseen data (the moment of truth!), and it unsurprisingly flopped.

诚然,这不是一件小事,要花一些时间才能将它们内部化(正如Hastie& Tibshirani所证明的那样,甚至有研究论文也是这样,这绝非偶然).执行错误).在此之前,我为您提供的安全建议是:在模型构建的所有 阶段(包括功能选择),假装您无权访问测试集,并且只有在您需要评估最终模型的性能时,该测试集才可用.

Admittedly, this is not trivial stuff, and it may take some time until you internalize them (it's no coincidence that, as Hastie & Tibshirani demonstrate, there are even research papers where the procedure is performed wrongly). Until then, my advice to keep you safe, is: during all stages of model building (including feature selection), pretend that you don't have access to the test set at all, and that it becomes available only when you need to assess the performance of your final model.

这篇关于应该在训练测试拆分之前还是之后进行特征选择?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆