在测试和训练数据集中使用基于时间的拆分来拆分数据 [英] Splitting data using time-based splitting in test and train datasets

查看:91
本文介绍了在测试和训练数据集中使用基于时间的拆分来拆分数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道 train_test_split 是随机拆分的,但我需要知道如何根据时间拆分.

I know that train_test_split splits it randomly, but I need to know how to split it based on time.

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) 
  # this splits the data randomly as 67% test and 33% train

如何在 67% 的训练和 33% 的测试中根据时间拆分相同的数据集?数据集有一列 TimeStamp.

How to split the same data set based on time as 67% train and 33% test? The dataset has a column TimeStamp.

我尝试搜索类似的问题,但不确定该方法.

I tried searching on the similar questions but was not sure about the approach.

谁能简单解释一下?

推荐答案

在时间序列数据集上,数据拆分以不同的方式发生.查看此链接了解更多信息.或者,您可以尝试 TimeSeriesSplit 来自 scikit-learn包裹.所以主要思想是这样的,假设您根据时间戳有10个数据点.现在拆分将是这样的:

On time-series datasets, data splitting takes place in a different way. See this link for more info. Alternatively, you can try TimeSeriesSplit from scikit-learn package. So the main idea is this, suppose you have 10 points of data according to timestamp. Now the splits will be like this :

Split 1 : 
Train_indices : 1 
Test_indices  : 2


Split 2 : 
Train_indices : 1, 2 
Test_indices  : 3


Split 3 : 
Train_indices : 1, 2, 3 
Test_indices  : 4

Split 4 : 
Train_indices : 1, 2, 3, 4 
Test_indices  : 5

依此类推.您可以查看上面链接中显示的示例,以更好地了解 TimeSeriesSplit 在 sklearn 中的工作原理

So on and so forth. You can check the example shown in the link above to get a better idea of how TimeSeriesSplit works in sklearn

更新如果您有一个单独的时间列,您可以简单地根据该列对数据进行排序,然后应用上面提到的 timeSeriesSplit 来获得拆分.

Update If you have a separate time column, you can simply sort the data based on that column and apply timeSeriesSplit as mentioned above to get the splits.

为了确保最终分割中有 67% 的训练数据和 33% 的测试数据,请指定分割数如下:

In order to ensure 67% training and 33% testing data in final split, specify number of splits as following:

no_of_split = int((len(data)-3)/3)

例子

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4],[1, 2], [3, 4],[3, 4],[1, 2],     [3, 4],[3, 4],[1, 2], [3, 4] ])
y = np.array([1, 2, 3, 4, 5, 6,7,8,9,10,11,12])
tscv = TimeSeriesSplit(n_splits=int((len(y)-3)/3))
for train_index, test_index in tscv.split(X):
     print("TRAIN:", train_index, "TEST:", test_index)

     #To get the indices 
     X_train, X_test = X[train_index], X[test_index]
     y_train, y_test = y[train_index], y[test_index]

输出:

('TRAIN:', array([0, 1, 2]), 'TEST:', array([3, 4, 5]))
('TRAIN:', array([0, 1, 2, 3, 4, 5]), 'TEST:', array([6, 7, 8]))
('TRAIN:', array([0, 1, 2, 3, 4, 5, 6, 7, 8]), 'TEST:', array([ 9, 10, 11]))

这篇关于在测试和训练数据集中使用基于时间的拆分来拆分数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆