Scikit-Learn:timeseriessplit 中的测试大小 [英] Scikit-Learn: Test Size in timeseriessplit

查看:83
本文介绍了Scikit-Learn:timeseriessplit 中的测试大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Scikit-Learn timeseriessplit 将我的数据拆分为训练集和测试集.目前 timeSeries 数据集的第一个分割是 50%,接下来是 30%,在 25% 之后.我想要固定 10% 的数据用作测试集.

I am using Scikit-Learn timeseriessplit to split my data into training and testing sets. Currently the first split of timeSeries data set is 50% and the next is 30% after that 25%. I want a fixed 10% of data to be used as testing set.

tscv = TimeSeriesSplit(n_splits=3)
for train_index, test_index in tscv.split(X):
    print(train_index, test_index)

输出为:

[   0    1    2 ..., 1067 1068 1069] [1070 1071 1072 ..., 2136 2137 2138]
[   0    1    2 ..., 2136 2137 2138] [2139 2140 2141 ..., 3205 3206 3207]
[   0    1    2 ..., 3205 3206 3207] [3208 3209 3210 ..., 4274 4275 4276]

我想要这样的东西:tscv = TimeSeriesSplit(n_splits=3, test_size= = 0.1) 类似于 train_test_split.

I would like something like this: tscv = TimeSeriesSplit(n_splits=3, test_size= = 0.1) similar to train_test_split.

如何只拆分 10% 的条目进行测试?

How can only 10% of the entries be split for tests?

推荐答案

没有直接参数供您指定百分比.但是您可以相应地修改 n_splits 以获得所需的结果.

There is no direct parameter for you to specify the percentage. But you can modify the n_splits accordingly to get the desired result.

文档中提到:-

在第 k 个分割中,它返回前 k 个折叠作为训练集和第 (k+1) 折作为测试集.

In the kth split, it returns first k folds as train set and the (k+1)th fold as test set.

现在你想要最后的 10% 作为测试,剩下的作为训练.所以使用 n_splits=9.然后它将在 for 循环的最后一次迭代中输出前 9 折作为训练,最后 1 折作为测试输出.

Now you want the last 10% as the test and rest as train. So use the n_splits=9. It will then output the first 9 folds as train and last 1 fold as test, in the last iteration of the for loop

因此相应地更改您的代码:

So change your code accordingly:

test_size = 0.1

# This conversion is found in the source of TimeSeriesSplit

n_splits = (1//test_size)-1   # using // for integer division

tscv = TimeSeriesSplit(n_splits=n_splits)
for train_index, test_index in tscv.split(X):
    print(train_index, test_index)

    # Read below comments about following code
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

如果您将 X_train、X_test 等放在 for 循环中,那么测试大小将保持在 0.1,但训练数据将相应更改(因为在 TimeSeries 中,只有之前的值测试的索引可以作为训练).

If you keep the X_train, X_test etc inside the for loop, then the test size will remain at 0.1, but the train data will be changed accordingly (Because in a TimeSeries, only the values before the index of test can be used as train).

如果将其保留在 for 循环之外,则将只有一组训练和测试,其中 0.9 次训练和 0.1 次测试.

If this is kept outside of for loop, there will be only one set of train and test with 0.9 train and 0.1 test.

编辑:我不能说他们为什么选择 k+1 作为测试集.请在此处查看 用户指南说明.但在源代码,他们使用了从 n_splits 计算的 test_size:-

EDIT: I cant say why they chose k+1 as test set. Please have a look at user guide explanation here. But in the source code, they have used the test_size, calculated from n_splits:-

n_samples = _num_samples(X)
n_splits = self.n_splits
n_folds = n_splits + 1
test_size = (n_samples // n_folds)

所以也许在下一个版本中他们可以将 test_size 作为参数.希望这可以帮助.如有任何疑问,请随时在此处发表评论.

So maybe in next versions they can have that test_size as parameter. Hope this helps. Feel free to comment here if any doubt.

这篇关于Scikit-Learn:timeseriessplit 中的测试大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆