如何在测试和训练中分别使用基于时间的数据拆分 [英] How to split data using Time Based in Test and Train Respectively
问题描述
如何使用基于时间的拆分将数据拆分为训练和测试.
How to split data into Train and Test by using time-based split.
我知道train_test_split会随机拆分它,以及如何根据时间拆分它.
I know that train_test_split splits it randomly how to split it based on Time.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# this splits the data randomly as 67% test and 33% train
如何在67%的训练和33%的测试下基于时间拆分相同的数据集?
How to Split the same data set based on time as 67% train and 33% test?
数据集的TimeStamp列.
The dataset has a column TimeStamp.
我尝试搜索类似的问题,但不确定该方法.
I tried searching on the similar questions but was not sure about the approach.
有人可以简要解释吗
推荐答案
在时间序列数据集上,数据拆分以不同的方式进行. 请参阅此链接以获取更多信息.或者,您可以尝试从scikit-learn中 TimeSeriesSplit 包裹.所以主要思想是这样,假设您根据时间戳记有10个数据点.现在,拆分将如下所示:
On time-series datasets, data splitting takes place in a different way. See this link for more info. Alternatively, you can try TimeSeriesSplit from scikit-learn package. So the main idea is this, suppose you have 10 points of data according to timestamp. Now the splits will be like this :
Split 1 :
Train_indices : 1
Test_indices : 2
Split 2 :
Train_indices : 1, 2
Test_indices : 3
Split 3 :
Train_indices : 1, 2, 3
Test_indices : 4
Split 4 :
Train_indices : 1, 2, 3, 4
Test_indices : 5
依此类推.您可以查看上面链接中显示的示例,以更好地了解TimeSeriesSplit在sklearn中的工作方式
So on and so forth. You can check the example shown in the link above to get a better idea of how TimeSeriesSplit works in sklearn
更新 如果您有单独的时间列,则可以简单地基于该列对数据进行排序,并按照上述方法应用timeSeriesSplit来获取拆分.
Update If you have a separate time column, you can simply sort the data based on that column and apply timeSeriesSplit as mentioned above to get the splits.
为确保最终拆分中67%的训练和33%的测试数据,请指定拆分次数,如下所示:
In order to ensure 67% training and 33% testing data in final split, specify number of splits as following:
no_of_split = int((len(data)-3)/3)
示例
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4],[1, 2], [3, 4],[3, 4],[1, 2], [3, 4],[3, 4],[1, 2], [3, 4] ])
y = np.array([1, 2, 3, 4, 5, 6,7,8,9,10,11,12])
tscv = TimeSeriesSplit(n_splits=int((len(y)-3)/3))
for train_index, test_index in tscv.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
#To get the indices
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
输出:
('TRAIN:', array([0, 1, 2]), 'TEST:', array([3, 4, 5]))
('TRAIN:', array([0, 1, 2, 3, 4, 5]), 'TEST:', array([6, 7, 8]))
('TRAIN:', array([0, 1, 2, 3, 4, 5, 6, 7, 8]), 'TEST:', array([ 9, 10, 11]))
这篇关于如何在测试和训练中分别使用基于时间的数据拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!