关于train_test_split用于机器学习的想法 [英] Thoughts about train_test_split for machine learning

查看:132
本文介绍了关于train_test_split用于机器学习的想法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚注意到,很多人甚至在处理丢失的数据之前就倾向于使用train_test_split,似乎他们在一开始就将数据分割了

I just noticed that many people tend to use train_test_split even before handling the missing data, and seem like they split the data at the very beginning

还有很多人,他们在完成所有数据清理和特征工程,特征选择之类的工作之后,往往会在模型构建步骤之前就将数据滑倒.

and there are also a bunch of people, they tend to slipt the data right before model building step after they do all the data cleaning and feature engineering, feature selection stuff.

人们一开始倾向于拆分数据,这是为了防止数据泄漏.

The people tend to split the data at the very first saying that it is to prevent the data leakage.

我现在对构建模型的流程非常困惑. 为什么我们需要一开始就隐藏数据?并在我们实际上可以进行所有数据清理和功能工程或为方便起见而将分类变量共同转换为虚拟变量之类的事情时分别清洁火车组和测试组?

I am right now just so confused about the pipeline of building a model. why do we need to slipt the data at the very beginning? and to clean the train set and test set separately when we can actually do all the data cleaning and feature engineering or things like transforming the categorical variable to dummy variable together for convenience purpose?

请帮助我 真的想知道一个方便而科学的管道

Please help me with this Really wanna know a convenient and scientific pipeline

推荐答案

您应尽早拆分数据.

简单地说,您的数据工程管道也可以构建模型.

考虑填写缺失值的简单想法.为此,您需要训练"一个迷你模型以生成均值或众数或要使用的其他平均值.然后,您可以使用此模型预测"缺失值.

Consider the simple idea of filling in missing values. To do this you need to "train" a mini-model to generate the mean or mode or some other average to use. Then you use this model to "predict" missing values.

如果您在针对这些微型模型的训练过程中包括测试数据,那么您将让训练过程窥视该数据并因此作弊.当它使用通过测试数据构建的值填充丢失的数据时,几乎没有暗示测试集是什么样的.这就是实践中数据泄漏"的含义.在理想的世界中,您可以忽略它,而只是使用所有数据进行训练,而使用训练分数来确定哪种模型最好.

If you include the test data in the training process for these mini-models, then you are letting the training process peek at that data and cheat a little bit because of that. When it fills in the missing data, with values built using the test data, it is leaving little hints about what the test set is like. This is what "data leakage" means in practice. In an ideal world you could ignore it, and instead just use all data for training use the training score to decide which model is best.

但这是行不通的,因为在实践中,模型只有在能够预测任何新数据时才有用,而不仅仅是在训练时可用的数据. Google Translate需要处理您和我今天在 中键入的内容,而不仅仅是以前训练过的内容.

But that won't work, because in practice a model is only useful once it is able to predict any new data, and not just the data available at training time. Google Translate needs to work on whatever you and I type in today, not just what it was trained with earlier.

因此,为了确保模型在发生这种情况时仍能继续正常工作,您应该以更可控的方式在一些新数据上对其进行测试.使用测试集的标准方法是将测试集尽早拆分出来,然后将其隐藏起来.

So, in order to ensure that the model will continue to work well when that happens, you should test it on some new data in a more controlled way. Using a test set, which has been split out as early as possible and then hidden away, is the standard way to do that.

是的,这意味着将数据工程分开进行培训和测试会带来一些不便.但是许多工具,例如 scikit ,它们将fittransform阶段分开建立具有正确训练/测试分隔的端到端数据工程和建模管道.

Yes, it means some inconvenience to split the data engineering up for training vs testing. But many tools like scikit, which splits the fit and transform stages, make it convenient to build an end-to-end data engineering and modeling pipeline with the right train/test separation.

这篇关于关于train_test_split用于机器学习的想法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆