scikit-learn - 具有置信区间的 ROC 曲线 [英] scikit-learn - ROC curve with confidence intervals

查看:71
本文介绍了scikit-learn - 具有置信区间的 ROC 曲线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以使用 scikit-learn 获得 ROC 曲线fpr, tpr, thresholds = metrics.roc_curve(y_true,y_pred, pos_label=1),其中 y_true 是基于我的黄金标准的值列表(即,0 表示否定,1 表示肯定),y_pred 是相应的分数列表(例如,0.0534972430.0085211220.0227815480.1018852630.012913795>0.0, 0.042881547 [...])

I am able to get a ROC curve using scikit-learn with fpr, tpr, thresholds = metrics.roc_curve(y_true,y_pred, pos_label=1), where y_true is a list of values based on my gold standard (i.e., 0 for negative and 1 for positive cases) and y_pred is a corresponding list of scores (e.g., 0.053497243, 0.008521122, 0.022781548, 0.101885263, 0.012913795, 0.0, 0.042881547 [...])

我试图弄清楚如何为该曲线添加置信区间,但没有找到任何简单的方法来使用 sklearn.

I am trying to figure out how to add confidence intervals to that curve, but didn't find any easy way to do that with sklearn.

推荐答案

您可以引导 roc 计算(示例替换新版本的 y_true/y_pred原始 y_true/y_pred 并每次重新计算 roc_curve 的新值)并以这种方式估计置信区间.

You can bootstrap the roc computations (sample with replacement new versions of y_true / y_pred out of the original y_true / y_pred and recompute a new value for roc_curve each time) and the estimate a confidence interval this way.

要考虑由火车测试拆分引起的可变性,您还可以使用 ShuffleSplit CV 迭代器多次,在训练分裂上拟合模型,为每个模型生成 y_pred 从而收集经验分布roc_curve 以及最后计算这些的置信区间.

To take the variability induced by the train test split into account, you can also use the ShuffleSplit CV iterator many times, fit a model on the train split, generate y_pred for each model and thus gather an empirical distribution of roc_curves as well and finally compute confidence intervals for those.

编辑:python 中的 boostrapping

Edit: boostrapping in python

以下是从单个模型的预测中引导 ROC AUC 分数的示例.我选择引导 ROC AUC 以使其更容易理解为 Stack Overflow 答案,但它可以改为引导整个曲线:

Here is an example for bootstrapping the ROC AUC score out of the predictions of a single model. I chose to bootstap the ROC AUC to make it easier to follow as a Stack Overflow answer, but it can be adapted to bootstrap the whole curve instead:

import numpy as np
from scipy.stats import sem
from sklearn.metrics import roc_auc_score

y_pred = np.array([0.21, 0.32, 0.63, 0.35, 0.92, 0.79, 0.82, 0.99, 0.04])
y_true = np.array([0,    1,    0,    0,    1,    1,    0,    1,    0   ])

print("Original ROC area: {:0.3f}".format(roc_auc_score(y_true, y_pred)))

n_bootstraps = 1000
rng_seed = 42  # control reproducibility
bootstrapped_scores = []

rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
    # bootstrap by sampling with replacement on the prediction indices
    indices = rng.randint(0, len(y_pred), len(y_pred))
    if len(np.unique(y_true[indices])) < 2:
        # We need at least one positive and one negative sample for ROC AUC
        # to be defined: reject the sample
        continue

    score = roc_auc_score(y_true[indices], y_pred[indices])
    bootstrapped_scores.append(score)
    print("Bootstrap #{} ROC area: {:0.3f}".format(i + 1, score))

你可以看到我们需要拒绝一些无效的重采样.然而,对于具有许多预测的真实数据,这是一个非常罕见的事件,不应显着影响置信区间(您可以尝试改变 rng_seed 以进行检查).

You can see that we need to reject some invalid resamples. However on real data with many predictions this is a very rare event and should not impact the confidence interval significantly (you can try to vary the rng_seed to check).

这是直方图:

import matplotlib.pyplot as plt
plt.hist(bootstrapped_scores, bins=50)
plt.title('Histogram of the bootstrapped ROC AUC scores')
plt.show()

请注意,重新采样的分数在 [0 - 1] 范围内进行审查,导致最后一个 bin 中的分数很高.

Note that the resampled scores are censored in the [0 - 1] range causing a high number of scores in the last bin.

要获得置信区间,可以对样本进行排序:

To get a confidence interval one can sort the samples:

sorted_scores = np.array(bootstrapped_scores)
sorted_scores.sort()

# Computing the lower and upper bound of the 90% confidence interval
# You can change the bounds percentiles to 0.025 and 0.975 to get
# a 95% confidence interval instead.
confidence_lower = sorted_scores[int(0.05 * len(sorted_scores))]
confidence_upper = sorted_scores[int(0.95 * len(sorted_scores))]
print("Confidence interval for the score: [{:0.3f} - {:0.3}]".format(
    confidence_lower, confidence_upper))

给出:

Confidence interval for the score: [0.444 - 1.0]

置信区间非常宽,但这可能是我选择的预测(9 个预测中有 3 个错误)和预测总数非常小的结果.

The confidence interval is very wide but this is probably a consequence of my choice of predictions (3 mistakes out of 9 predictions) and the total number of predictions quite small.

图中的另一个注释:分数被量化(许多空的直方图箱).这是预测数量少的结果.可以在分数(或 y_pred 值)上引入一些高斯噪声以平滑分布并使直方图看起来更好.但是平滑带宽的选择很棘手.

Another remark on the plot: the scores are quantized (many empty histogram bins). This is a consequence of the small number of predictions. One could introduce a bit of Gaussian noise on the scores (or the y_pred values) to smooth the distribution and make the histogram look better. But then the choice of the smoothing bandwidth is tricky.

最后,如前所述,此置信区间特定于您的训练集.为了更好地估计由模型类和参数引起的 ROC 的可变性,您应该进行迭代交叉验证.然而,这通常成本更高,因为您需要为每个随机训练/测试分组训练一个新模型.

Finally as stated earlier this confidence interval is specific to you training set. To get a better estimate of the variability of the ROC of induced by your model class and parameters, you should do iterated cross-validation instead. However this is often much more costly as you need to train a new model for each random train / test split.

这篇关于scikit-learn - 具有置信区间的 ROC 曲线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆