时间序列数据的分层交叉验证 [英] Stratified Cross validation of timeseries data
问题描述
我想基于组(grp列)进行时间序列交叉验证.在下面的示例数据中,温度是我的目标变量
I want to do a time series cross validation based on group (grp column). In the below sample data, Temperature is my target variable
import numpy as np
import pandas as pd
timeS=pd.date_range(start='1980-01-01 00:00:00', end='1980-01-01 00:00:05',
freq='S')
df = pd.DataFrame(dict(time=timeS, grp=['A']*3 + ['B']*3, material=[1,2,3]*2,
temperature=['2.4','5','9.9']*2))
grp material temperature time
0 A 1 2.4 1980-01-01 00:00:00
1 A 2 5 1980-01-01 00:00:01
2 A 3 9.9 1980-01-01 00:00:02
3 B 1 2.4 1980-01-01 00:00:03
4 B 2 5 1980-01-01 00:00:04
5 B 3 9.9 1980-01-01 00:00:05
我打算使用此代码基于grp添加一些滞后功能.
i am planing to add some lag features based on grp using this code.
df.groupby("grp")['temperature'].shift(-1)
0 5
1 9.9
2 NaN
3 5
4 9.9
5 NaN
Name: temperature, dtype: object
我现在遇到的问题是,当我进行交叉验证时,可以使用sklearn sklearn.model_selection.TimeSeriesSplit 的此函数,但它没有考虑组效应.谁能告诉我如何按组进行CV拆分(例如分层拆分)?如果有帮助,我将使用xgboost.cv进行简历.
The problem now i have is when i do cross validation I can using this function from sklearn sklearn.model_selection.TimeSeriesSplit but it does not take into consideration of the group effect. Can anyone tell me how to do the CV split per group (like stratified split)? I am going to use xgboost.cv for cv if that helps.
每个组的时间更改.组中的时间均匀地(每秒)增加
Time changes per group. Time increases uniformly (per second) within the group
推荐答案
请执行以下操作:
series = Series.from_csv('yourfile.csv', header=0)
X = series.values
n_train = 500
n_records = len(X)
for i in range(n_train, n_records):
train, test = X[0:i], X[i:i+1]
print('train=%d, test=%d' % (len(train), len(test)))
这篇关于时间序列数据的分层交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!