在sklearn.cross_validation中使用train_test_split和cross_val_score之间的区别 [英] Difference between using train_test_split and cross_val_score in sklearn.cross_validation

查看:743
本文介绍了在sklearn.cross_validation中使用train_test_split和cross_val_score之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个20列的矩阵。最后一列是0/1标签。

I have a matrix with 20 columns. The last column are 0/1 labels.

到数据的链接是此处

我正在尝试使用交叉验证对数据集运行随机森林。我使用两种方法:

I am trying to run random forest on the dataset, using cross validation. I use two methods of doing this:


  1. 使用 sklearn.cross_validation.cross_val_score

  2. 使用 sklearn.cross_validation.train_test_split

  1. using sklearn.cross_validation.cross_val_score
  2. using sklearn.cross_validation.train_test_split

当我做我认为几乎完全相同的事情时,我会得到不同的结果。为了举例说明,我使用上面的两种方法进行了两次交叉验证,如下代码所示。

I am getting different results when I do what I think is pretty much the same exact thing. To exemplify, I run a two-fold cross validation using the two methods above, as in the code below.

import csv
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score

#read in the data
data = pd.read_csv('data_so.csv', header=None)
X = data.iloc[:,0:18]
y = data.iloc[:,19]

depth = 5
maxFeat = 3 

result = cross_val_score(ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False), X, y, scoring='roc_auc', cv=2)

result
# result is now something like array([ 0.66773295,  0.58824739])

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.50)

RFModel = ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False)
RFModel.fit(xtrain,ytrain)
prediction = RFModel.predict_proba(xtest)
auc = roc_auc_score(ytest, prediction[:,1:2])
print auc    #something like 0.83

RFModel.fit(xtest,ytest)
prediction = RFModel.predict_proba(xtrain)
auc = roc_auc_score(ytrain, prediction[:,1:2])
print auc    #also something like 0.83

我的问题是:

为什么我得到不同的结果,即为什么当我使用 train_test_split时,AUC(我正在使用的指标)更高? code>?

why am I getting different results, ie, why is the AUC (the metric I am using) higher when I use train_test_split?

注意:
当我使用更多倍数(例如10倍)时,结果中似乎有某种形式,第一次计算总是给出我是最高的AUC。

Note: When I using more folds (say 10 folds), there appears to be some kind of pattern in my results, with the first calculation always giving me the highest AUC.

在上面的示例中有两次交叉验证的情况下,第一个AUC总是高于第二个AUC。始终为0.70和0.58。

In the case of the two-fold cross validation in the example above, the first AUC is always higher than the second one; it's always something like 0.70 and 0.58.

感谢您的帮助!

推荐答案

使用cross_val_score时,您经常会使用KFolds或StratifiedKFolds迭代器:

When using cross_val_score, you'll frequently want to use a KFolds or StratifiedKFolds iterator:

http://scikit-learn.org/0.10/modules/cross_validation.html#computing-cross-validated-metrics

http://scikit-learn.org/0.10 /modules/generation/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold

默认情况下,cross_val_score不会随机化数据,这会产生如果您的数据并非一开始是随机的,则结果会很奇怪。

By default, cross_val_score will not randomize your data, which can produce odd results like this if you're data isn't random to begin with.

KFolds迭代器具有随机状态参数:

The KFolds iterator has a random state parameter:

http://scikit-learn.org/stable/modules/genic/sklearn.cross_validation.KFold.html

train_test_split也是如此,默认情况下它会随机进行:

So does train_test_split, which does randomize by default:

http://scikit-learn.org/stable/modules/generation/sklearn.cross_validation.train_test_split.html

像您描述的那样的模式通常是由于训练/测试集中缺乏随机性导致的。

Patterns like what you described are usually a result of a lack of randomnesss in the train/test set.

这篇关于在sklearn.cross_validation中使用train_test_split和cross_val_score之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆