sklearn中的留一法交叉验证的ROC曲线 [英] ROC curve with Leave-One-Out Cross validation in sklearn

查看:336
本文介绍了sklearn中的留一法交叉验证的ROC曲线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用留一法交叉验证绘制分类器的 ROC曲线.

I want to plot a ROC curve of a classifier using leave-one-out cross validation.

似乎已经问过类似的问题

It seems that a similar question has been asked here but without any answer.

在另一个问题中,在这里被声明:

In another question here is was stated:

为了通过LeaveOneOut获得有意义的ROC AUC,您需要 计算每个折的概率估计(每个折仅由 一个观察值),然后根据所有这些计算出ROC AUC 概率估计.

In order to obtain a meaningful ROC AUC with LeaveOneOut, you need to calculate probability estimates for each fold (each consisting of just one observation), then calculate the ROC AUC on the set of all these probability estimates.

此外,在scikit-learn官方网站上有一个类似的示例,但使用KFold交叉验证(

Additionally, in the official scikit-learn website there is a similar example but using KFold cross validation (https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#sphx-glr-auto-examples-model-selection-plot-roc-crossval-py).

所以对于留一法交叉验证案例,我正在考虑收集测试集(当时是一个样本)上的所有概率预测,并在获得所有概率的预测概率后我的褶皱,以计算和绘制ROC曲线.

So for the leave-one-out cross validation case, I am thinking of gathering all the probability prediction on the test sets (one sample at the time) and after having the predicted probabilities for all my folds, to compute and plot the ROC curve.

这看起来还可以吗?我没有其他实现目标的方法.

Does this seems okay? I do not see any other way to achieve my goal.

这是我的代码:

from sklearn.svm import SVC
import numpy as np, matplotlib.pyplot as plt,  pandas as pd
from sklearn.model_selection import cross_val_score,cross_val_predict,  KFold,  LeaveOneOut, StratifiedKFold
from sklearn.metrics import roc_curve, auc
from sklearn import datasets

# Import some data to play with
iris = datasets.load_iris()
X_svc = iris.data
y = iris.target
X_svc, y = X_svc[y != 2], y[y != 2]

clf = SVC(kernel='linear', class_weight='balanced', probability=True, random_state=0)
kf = LeaveOneOut()

all_y = []
all_probs=[]
for train, test in kf.split(X_svc, y):
    all_y.append(y[test])
    all_probs.append(clf.fit(X_svc[train], y[train]).predict_proba(X_svc[test])[:,1])
all_y = np.array(all_y)
all_probs = np.array(all_probs)

fpr, tpr, thresholds = roc_curve(all_y,all_probs)
roc_auc = auc(fpr, tpr)
plt.figure(1, figsize=(12,6))
plt.plot(fpr, tpr, lw=2, alpha=0.5, label='LOOCV ROC (AUC = %0.2f)' % (roc_auc))
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Chance level', alpha=.8)
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.grid()
plt.show()

推荐答案

我相信代码是正确的,并且拆分也是如此.我为实现和结果的验证添加了几行:

I believe the code is correct and the splitting too. I've added a few lines for validation purposes of both the implementation and the results:

from sklearn.model_selection import cross_val_score,cross_val_predict,  KFold,  LeaveOneOut, StratifiedKFold
from sklearn.metrics import roc_curve, auc
from sklearn import datasets

# Import some data to play with
iris = datasets.load_iris()
X_svc = iris.data
y = iris.target
X_svc, y = X_svc[y != 2], y[y != 2]

clf = SVC(kernel='linear', class_weight='balanced', probability=True, random_state=0)
kf = LeaveOneOut()
if kf.get_n_splits(X_svc) == len(X_svc):
    print("They are the same length, splitting correct")
else:
    print("Something is wrong")
all_y = []
all_probs=[]
for train, test in kf.split(X_svc, y):
    all_y.append(y[test])
    all_probs.append(clf.fit(X_svc[train], y[train]).predict_proba(X_svc[test])[:,1])
all_y = np.array(all_y)
all_probs = np.array(all_probs)
#print(all_y) #For validation 
#print(all_probs) #For validation

fpr, tpr, thresholds = roc_curve(all_y,all_probs)
print(fpr, tpr, thresholds) #For validation
roc_auc = auc(fpr, tpr)
plt.figure(1, figsize=(12,6))
plt.plot(fpr, tpr, lw=2, alpha=0.5, label='LOOCV ROC (AUC = %0.2f)' % (roc_auc))
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Chance level', alpha=.8)
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.grid()
plt.show()

If行旨在仅确保将拆分进行n次,其中n是给定数据集的观察次数.这是因为如文档所述,LeaveOneOut的作用与Kfold(n_splits=n) and LeaveOneOut(p=1)相同. 同样,在打印预测的Proba值时,它们很好,可以理解曲线.恭喜您的1.00AUC!

The If line is meant to only make sure that the splitting is made n times, where n is the number of observations for the given dataset. This is because as the documentation states, LeaveOneOut works the same as Kfold(n_splits=n) and LeaveOneOut(p=1). Also when printing the predicted proba values they were good, making sense of the curve. Congratz on your 1.00AUC!

这篇关于sklearn中的留一法交叉验证的ROC曲线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆