了解ROC曲线 [英] Understanding ROC curve

查看:156
本文介绍了了解ROC曲线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc , roc_auc_score
import numpy as np

correct_classification = np.array([0,1])
predicted_classification = np.array([1,1])

false_positive_rate, true_positive_rate, tresholds = roc_curve(correct_classification, predicted_classification)

print(false_positive_rate)
print(true_positive_rate)

来自 https://en.wikipedia.org/wiki/Sensitivity_and_specificity :

True positive: Sick people correctly identified as sick 
False positive: Healthy people incorrectly identified as sick 
True negative: Healthy people correctly identified as healthy 
False negative: Sick people incorrectly identified as healthy

我正在使用以下值0:生病,1:健康

来自 https://en.wikipedia.org/wiki/False_positive_rate :

flase阳性率=假阳性/(假阳性+真阴性)

误报数:0 真阴性数:1

因此,假阳性率= 0/0 + 1 = 0

读取roc_curve( > http://scikit-learn.org/stable/modules/generation/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve ):

fpr:数组,形状= [> 2]

增加误判率,使元素i为假 得分> =阈值[i]的预测的阳性率.

tpr:数组,形状= [> 2]

增加真实正利率,使元素i为真实 得分> =阈值[i]的预测的阳性率.

阈值:数组,形状= [n_thresholds]

用于计算fpr和 tpr. thresholds [0]表示未预测任何实例,并且为 任意设置为max(y_score)+1.

这与我手动计算误报率有何不同?阈值如何设置?此处提供了有关阈值的一些模式信息: https://datascience.stackexchange.com/questions/806/advantages-of-auc-vs-standard-accuracy ,但是我对它与该实现方式的配合感到困惑吗?

解决方案

首先,维基百科正在考虑生病= 1.

是肯定的:正确识别出有病的生病的人

第二,每个模型都有基于正分类概率的阈值(通常为0.5).

因此,如果阈值是0.1,则所有概率大于0.1的样本都将被归类为正.预测样本的概率是固定的,阈值将有所变化.

roc_curve中,scikit-learn从以下位置增加阈值:

 0 (or minimum value where all the predictions are positive) 

1 (Or the last point where all predictions become negative).

根据预测从正值到负值的变化来确定中间点.

示例:

Sample 1      0.2
Sample 2      0.3
Sample 3      0.6
Sample 4      0.7
Sample 5      0.8

此处的最低概率为0.2,因此任何有意义的最小阈值为0.2.现在,随着我们不断提高阈值,由于本示例中的点数很少,因此阈值点将在每种概率下发生变化(并且等于该概率,因为那是正负数发生变化的点)

                     Negative    Positive
               <0.2     0          5
Threshold1     >=0.2    1          4
Threshold2     >=0.3    2          3
Threshold3     >=0.6    3          2
Threshold4     >=0.7    4          1
Threshold5     >=0.8    5          0

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc , roc_auc_score
import numpy as np

correct_classification = np.array([0,1])
predicted_classification = np.array([1,1])

false_positive_rate, true_positive_rate, tresholds = roc_curve(correct_classification, predicted_classification)

print(false_positive_rate)
print(true_positive_rate)

From https://en.wikipedia.org/wiki/Sensitivity_and_specificity :

True positive: Sick people correctly identified as sick 
False positive: Healthy people incorrectly identified as sick 
True negative: Healthy people correctly identified as healthy 
False negative: Sick people incorrectly identified as healthy

I'm using these values 0 : sick, 1 : healthy

From https://en.wikipedia.org/wiki/False_positive_rate :

flase positive rate = false positive / (false positive + true negative)

number of false positive : 0 number of true negative : 1

therefore false positive rate = 0 / 0 + 1 = 0

Reading the return value for roc_curve (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve) :

fpr : array, shape = [>2]

Increasing false positive rates such that element i is the false positive rate of predictions with score >= thresholds[i].

tpr : array, shape = [>2]

Increasing true positive rates such that element i is the true positive rate of predictions with score >= thresholds[i].

thresholds : array, shape = [n_thresholds]

Decreasing thresholds on the decision function used to compute fpr and tpr. thresholds[0] represents no instances being predicted and is arbitrarily set to max(y_score) + 1.

How is this a differing value to my manual calculation of false positive rate ? How are thresholds set ? Some mode information on thresholds is provided here : https://datascience.stackexchange.com/questions/806/advantages-of-auc-vs-standard-accuracy but I'm confused as to how it fits with this implementation ?

解决方案

First, the wikipedia is considering sick=1.

True positive: Sick people correctly identified as sick

Second, every model has some threshold based on probabilities of positive class (generally 0.5).

So if threshold is 0.1, all samples having probabilities greater than 0.1 will be classified as positive. The probabilities of the predicted samples are fixed and thresholds will be varied.

In the roc_curve, scikit-learn increases the threshold value from:

 0 (or minimum value where all the predictions are positive) 

to

1 (Or the last point where all predictions become negative).

Intermediate points are decided based on changes of predictions from positive to negative.

Example:

Sample 1      0.2
Sample 2      0.3
Sample 3      0.6
Sample 4      0.7
Sample 5      0.8

The lowest probability here is 0.2, so the minimum threshold to make any sense is 0.2. Now as we keep increasing the threshold, since there are very less points in this example, threshold points will be changed at each probability (and is equal to that probability, because thats the point where number of positives and negatives change)

                     Negative    Positive
               <0.2     0          5
Threshold1     >=0.2    1          4
Threshold2     >=0.3    2          3
Threshold3     >=0.6    3          2
Threshold4     >=0.7    4          1
Threshold5     >=0.8    5          0

这篇关于了解ROC曲线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆