训练和测试集是否必须与预测集不同(以便您需要对所有列应用时移)? [英] Does the training+testing set have to be different from the predicting set (so that you need to apply a time-shift to ALL columns)?

查看:320
本文介绍了训练和测试集是否必须与预测集不同(以便您需要对所有列应用时移)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道一般规则,我们应该只在测试集上测试经过训练的分类器。

I know the general rule that we should test a trained classifier only on the testing set.

但是现在出现了问题: 当我已经受过训练时并准备好经过测试的分类器,是否可以将其应用于训练和测试集基础的同一数据集? 还是我必须将其应用于与训练不同的新预测集+测试集?

But now comes the question: When I have an already trained and tested classifier ready, can I apply it to the same dataset that was the base of the training and testing set? Or do I have to apply it to a new predicting set that is different from the training+testing set?

如果我预测时间序列 的标签列怎么办(稍后编辑:我并不是要在此处创建经典的时间序列分析,但是我只能从典型的数据库中选择广泛的列,就可以将每周,每月或随机存储的数据转换成单独的功能列,每个功能列分别按每周,每月/每年...) 必须将训练+测试设置的所有特征(不仅是时间序列标签列的过去列,而且还包括所有其他正常特征)移回数据具有时间点没有知识

And what if I predict a label column of a time series (edited later: I do not mean to create a classical time series analysis here, but just a broad selection of columns from a typical database, weekly, monthly or randomly stored data that I convert into separate feature columns, each for one week / month / year ...), do I have to shift all of the features (not just the past columns of the time series label column, but also all other normal features) of the training+testing set back to a point in time where the data has no "knowledge" interception with the predicting set?

然后我将训练和测试分类器,以将其转移到过去n个月的特征上,对未移动且最新的标签列进行评分,然后进行预测来自最新的,不变的功能。移位和未移位特征具有相同的列数,我通过将移位特征的列名分配给未移位特征来对齐移位和未移位特征。

I would then train and test the classifier on features shifted to the past by n months, scoring against a label column that is unshifted and most recent, and then predicting from most recent, unshifted features. Shifted and unshifted features have the same number of columns, I align shifted and unshifted features by assigning the column names of the shifted features to the unshifted features.

ps:

ps1: https://en.wikipedia.org/wiki/上的常规方法Dependent_and_independent_variables

在数据挖掘工具(用于多元统计和机器学习)中,因变量被指定为目标变量(或在某些工具中为标签属性),而独立变量可以被指定为常规变量。[8]为训练数据集和测试数据集提供了目标变量的已知值,但应针对其他数据进行预测

In data mining tools (for multivariate statistics and machine learning), the dependent variable is assigned a role as target variable (or in some tools as label attribute), while an independent variable may be assigned a role as regular variable.[8] Known values for the target variable are provided for the training data set and test data set, but should be predicted for other data.

ps2:在此基础知识中在本教程中,我们可以看到预测集有所不同: https:// scikit- learning.org/stable/tutorial/basic/tutorial.html

p.s.2: In this basic tutorial we can see that the predicting set is made different: https://scikit-learn.org/stable/tutorial/basic/tutorial.html

我们使用[:-1] Python语法选择训练集,该训练集产生一个包含以下内容的新数组全部>但是,digits.data中的最后一项:[…]现在您可以预测新值了。在这种情况下,您将预测使用digits.data [-1:]中的最后一张图像。通过预测,您将从训练集中确定与最后一张图像最匹配的图像。

We select the training set with the [:-1] Python syntax, which produces a new array that contains all > but the last item from digits.data: […] Now you can predict new values. In this case, you’ll predict using the last image from digits.data [-1:]. By predicting, you’ll determine the image from the training set that best matches the last image.

推荐答案

半小时后回答自己一年在这里。第一个答案是我对一个带有不清楚的问题(已编辑)的时间序列一词的误解。

Answering myself after half a year here. The first answer was a slight misunderstanding about the term "time series" which I had caused with an unclear question (edited).

上面的问题已经准备好经过训练和测试的分类器,我可以将其应用于作为训练和测试集基础的同一数据集吗?的答案很简单:没有。

The question above When I have an already trained and tested classifier ready, can I apply it to the same dataset that was the base of the training and testing set? has the simple answer: no.

是否必须转移所有功能 上面的问题,答案很简单:是的。

The question above Do I have to shift all of the features has the simple answer: yes.

简而言之,如果我预测一个月的班级列:除了我已转换为功能的上一个班月以外,我还必须将所有非班级列都移回过去,则必须在预测该类的月份之前就知道所有数据。

In short, if I predict a month's class column: I have to shift all of the non-class columns also back in time in addition to the previous class months I converted to features, all data must have been known before the month in that the class is predicted.

这篇关于训练和测试集是否必须与预测集不同(以便您需要对所有列应用时移)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆