sklearn TimeSeriesSplit cross_val_predict仅适用于分区 [英] sklearn TimeSeriesSplit cross_val_predict only works for partitions

查看:332
本文介绍了sklearn TimeSeriesSplit cross_val_predict仅适用于分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在带有LogisticRegression估计器的sklearn 0.18.1版中使用TimeSeriesSplit交叉验证策略.我收到一条错误消息,指出:

I am trying to use the TimeSeriesSplit cross-validation strategy in sklearn version 0.18.1 with a LogisticRegression estimator. I get an error stating that:

cross_val_predict仅适用于分区

cross_val_predict only works for partitions

以下代码段显示了如何重现:

The following code snippet shows how to reproduce:

from sklearn import linear_model, neighbors
from sklearn.model_selection import train_test_split, cross_val_predict, TimeSeriesSplit, KFold, cross_val_score
import pandas as pd
import numpy as np
from datetime import date, datetime

df = pd.DataFrame(data=np.random.randint(0,10,(100,5)), index=pd.date_range(start=date.today(), periods=100), columns='x1 x2 x3 x4 y'.split())


X, y = df['x1 x2 x3 x4'.split()], df['y']
score = cross_val_score(linear_model.LogisticRegression(fit_intercept=True), X, y, cv=TimeSeriesSplit(n_splits=2))
y_hat = cross_val_predict(linear_model.LogisticRegression(fit_intercept=True), X, y, cv=TimeSeriesSplit(n_splits=2), method='predict_proba')

我在做什么错了?

推荐答案

有几种方法可以在cross_val_score中传递cv参数.在这里,您必须将生成器传递给拆分.例如

There are several ways to pass the cv argument in cross_val_score. Here you have to pass the generator for the splits. For example

y = range(14)
cv = TimeSeriesSplit(n_splits=2).split(y)

提供一个生成器.这样,您可以生成CV序列和测试索引数组.第一个看起来像这样:

gives a generator. With this you can generate the CV train and test index arrays. The first looks like this:

print cv.next()
    (array([0, 1, 2, 3, 4, 5, 6, 7]), array([ 8,  9, 10, 11, 12, 13]))

您也可以将数据框作为split的输入.

You can also take a dataframe as input for split.

df = pd.DataFrame(data=np.random.randint(0,10,(100,5)), 
                  index=pd.date_range(start=date.today(), 
                  periods=100), columns='x1 x2 x3 x4 y'.split())

cv = TimeSeriesSplit(n_splits=2).split(df)
print cv.next()
    (array([ 0,  1,  2, ..., 31, 32, 33]), array([34, 35, 36, ..., 64, 65, 66]))

在您的情况下,这应该可行:

In your case this should work:

score = cross_val_score(linear_model.LogisticRegression(fit_intercept=True), 
                         X, y, cv=TimeSeriesSplit(n_splits=2).split(df))

看看 cross_val_score TimeSeriesSplit 有关详细信息.

这篇关于sklearn TimeSeriesSplit cross_val_predict仅适用于分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆