如何比较不同二进制分类器的ROC AUC分数并评估Python中的统计显着性?(p值,置信区间) [英] How to compare ROC AUC scores of different binary classifiers and assess statistical significance in Python? (p-value, confidence interval)

查看:362
本文介绍了如何比较不同二进制分类器的ROC AUC分数并评估Python中的统计显着性?(p值,置信区间)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想比较Python中不同的二进制分类器.为此,我要计算ROC AUC分数,测量 95%置信区间(CI) p值 以访问统计信息意义.

I would like to compare different binary classifiers in Python. For that, I want to calculate the ROC AUC scores, measure the 95% confidence interval (CI), and p-value to access statistical significance.

下面是scikit-learn中的一个最小示例,该示例在二进制分类数据集上训练三个不同的模型,绘制ROC曲线并计算AUC分数.

Below is a minimal example in scikit-learn which trains three different models on a binary classification dataset, plots the ROC curves and calculates the AUC scores.

这是我的具体问题:

  1. 如何计算测试集上ROC AUC分数的 95%置信区间(CI)?(例如使用引导程序).
  2. 如何比较(测试集上的)AUC分数并测量 p值 以评估统计显着性?(零假设是模型没有不同.拒绝零假设意味着AUC得分的差异具有统计学意义.)
  1. How to calculate the 95% confidence interval (CI) of the ROC AUC scores on the test set? (e.g. with bootstrapping).
  2. How to compare the AUC scores (on test set) and measure the p-value to assess statistical significance? (The null hypothesis is that the models are not different. Rejecting the null hypothesis means the difference in AUC scores is statistically significant.)

.

import numpy as np

np.random.seed(2018)

from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import matplotlib
import matplotlib.pyplot as plt

data = load_breast_cancer()

X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=17)

# Naive Bayes Classifier
nb_clf = GaussianNB()
nb_clf.fit(X_train, y_train)
nb_prediction_proba = nb_clf.predict_proba(X_test)[:, 1]

# Ranodm Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=20)
rf_clf.fit(X_train, y_train)
rf_prediction_proba = rf_clf.predict_proba(X_test)[:, 1]

# Multi-layer Perceptron Classifier
mlp_clf = MLPClassifier(alpha=1, hidden_layer_sizes=150)
mlp_clf.fit(X_train, y_train)
mlp_prediction_proba = mlp_clf.predict_proba(X_test)[:, 1]


def roc_curve_and_score(y_test, pred_proba):
    fpr, tpr, _ = roc_curve(y_test.ravel(), pred_proba.ravel())
    roc_auc = roc_auc_score(y_test.ravel(), pred_proba.ravel())
    return fpr, tpr, roc_auc


plt.figure(figsize=(8, 6))
matplotlib.rcParams.update({'font.size': 14})
plt.grid()
fpr, tpr, roc_auc = roc_curve_and_score(y_test, rf_prediction_proba)
plt.plot(fpr, tpr, color='darkorange', lw=2,
         label='ROC AUC={0:.3f}'.format(roc_auc))
fpr, tpr, roc_auc = roc_curve_and_score(y_test, nb_prediction_proba)
plt.plot(fpr, tpr, color='green', lw=2,
         label='ROC AUC={0:.3f}'.format(roc_auc))
fpr, tpr, roc_auc = roc_curve_and_score(y_test, mlp_prediction_proba)
plt.plot(fpr, tpr, color='crimson', lw=2,
         label='ROC AUC={0:.3f}'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.legend(loc="lower right")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1 - Specificity')
plt.ylabel('Sensitivity')
plt.show()

推荐答案

95% 置信区间的引导

您想对数据进行多次重采样重复进行分析.在一般情况下,假设您有一个函数 f(x),该函数根据数据 x 确定所需的统计信息,并且您可以像这样自举:

Bootstrap for 95% confidence interval

You want to repeat your analysis on multiple resamplings of your data. In the general case, assume you have a function f(x) that determines whatever statistic you need from data x and you can bootstrap like this:

def bootstrap(x, f, nsamples=1000):
    stats = [f(x[np.random.randint(x.shape[0], size=x.shape[0])]) for _ in range(nsamples)]
    return np.percentile(stats, (2.5, 97.5))

这为您提供了所谓的 95% 置信区间的插件估计(即,您只需取自举分布的百分位数).

This gives you so-called plug-in estimates of the 95% confidence interval (i.e. you just take the percentiles of the bootstrap distribution).

在你的情况下,你可以像这样编写一个更具体的函数

In your case, you can write a more specific function like this

def bootstrap_auc(clf, X_train, y_train, X_test, y_test, nsamples=1000):
    auc_values = []
    for b in range(nsamples):
        idx = np.random.randint(X_train.shape[0], size=X_train.shape[0])
        clf.fit(X_train[idx], y_train[idx])
        pred = clf.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
        auc_values.append(roc_auc)
    return np.percentile(auc_values, (2.5, 97.5))

这里,clf 是要测试性能的分类器,X_train, y_train, X_test>, y_test 就像在你的代码中一样.

Here, clf is the classifier for which you want to test the performance and X_train, y_train, X_test, y_test are like in your code.

这给了我以下置信区间(四舍五入到三位数字,1000个引导样本):

This gives me the following confidence intervals (rounded to three digits, 1000 bootstrap samples):

  • 朴素贝叶斯:0.986 [0.980 0.988](估计置信区间的上下限)
  • 随机森林:0.983 [0.974 0.989]
  • 多层感知器:0.974 [0.223 0.98]

技术上,排列测试将对观察序列的所有排列进行遍历,并使用排列后的目标值(功能未排列)评估roc曲线.如果您有一些观察结果,这是可以的,但如果您有更多的观察结果,则将变得非常昂贵.因此,通常对子集的数量进行二次采样,然后简单地进行一些随机的子集.在这里,实现更多地取决于您要测试的特定事物.下面的函数针对您的roc_auc值执行此操作

A permutation test would technically go over all permutations of your observation sequence and evaluate your roc curve with the permuted target values (features are not permuted). This is ok if you have a few observations, but it becomes very costly if you more observations. It is therefore common to subsample the number of permutations and simply do a number of random permutations. Here, the implementation depends a bit more on the specific thing you want to test. The following function does that for your roc_auc values

def permutation_test(clf, X_train, y_train, X_test, y_test, nsamples=1000):
    idx1 = np.arange(X_train.shape[0])
    idx2 = np.arange(X_test.shape[0])
    auc_values = np.empty(nsamples)
    for b in range(nsamples):
        np.random.shuffle(idx1)  # Shuffles in-place
        np.random.shuffle(idx2)
        clf.fit(X_train, y_train[idx1])
        pred = clf.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test[idx2].ravel(), pred.ravel())
        auc_values[b] = roc_auc
    clf.fit(X_train, y_train)
    pred = clf.predict_proba(X_test)[:, 1]
    roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
    return roc_auc, np.mean(auc_values >= roc_auc)

此函数再次将您的分类器作为 clf ,并返回未经混洗的数据的AUC值和p值(即观察到AUC值大于或等于您在AUC中的值的概率)未洗牌的数据).

This function again takes your classifier as clf and returns the AUC value on the unshuffled data and the p-value (i.e. probability to observe an AUC value larger than or equal to what you have in the unshuffled data).

使用1000个样本运行此操作,则所有三个分类器的p值均为0.请注意,由于采样的原因,这些并不是精确的,但它们表明所有这些分类器的效果均好于偶然性.

Running this with 1000 samples gives p-values of 0 for all three classifiers. Note that these are not exact because of the sampling, but they are an indicating that all of these classifiers perform better than chance.

这要容易得多.给定两个分类器,您可以对每个观察值进行预测.您只需要像这样将预测和分类器之间的分配混为一谈

This is much easier. Given two classifiers, you have prediction for every observation. You just shuffle the assignment between predictions and classifiers like this

def permutation_test_between_clfs(y_test, pred_proba_1, pred_proba_2, nsamples=1000):
    auc_differences = []
    auc1 = roc_auc_score(y_test.ravel(), pred_proba_1.ravel())
    auc2 = roc_auc_score(y_test.ravel(), pred_proba_2.ravel())
    observed_difference = auc1 - auc2
    for _ in range(nsamples):
        mask = np.random.randint(2, size=len(pred_proba_1.ravel()))
        p1 = np.where(mask, pred_proba_1.ravel(), pred_proba_2.ravel())
        p2 = np.where(mask, pred_proba_2.ravel(), pred_proba_1.ravel())
        auc1 = roc_auc_score(y_test.ravel(), p1)
        auc2 = roc_auc_score(y_test.ravel(), p2)
        auc_differences(auc1 - auc2)
    return observed_difference, np.mean(auc_differences >= observed_difference)

通过此测试和1000个样本,我发现三个分类器之间没有显着差异:

With this test and 1000 samples, I find no significant differences between the three classifiers:

  • 朴素贝叶斯vs随机森林:diff = 0.0029,p(diff>)= 0.311
  • 朴素贝叶斯vs MLP:diff = 0.0117,p(diff>)= 0.186
  • 随机森林vs MLP:diff = 0.0088,p(diff>)= 0.203

其中diff表示两个分类器之间的roc曲线差异,而p(diff>)是在混洗后的数据集上观察到较大差异的经验概率.

Where diff denotes the difference in roc curves between the two classifiers and p(diff>) is the empirical probability to observe a larger difference on a shuffled data set.

这篇关于如何比较不同二进制分类器的ROC AUC分数并评估Python中的统计显着性?(p值,置信区间)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆