sklearn precision_recall_curve 和阈值 [英] sklearn precision_recall_curve and threshold

查看:224
本文介绍了sklearn precision_recall_curve 和阈值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道 sklearn 如何决定在 precision_recall_curve 中使用多少个阈值.这里有另一篇文章:sklearn 如何选择精确召回曲线中的阈值步长?.它提到了我发现这个例子的源代码

I was wondering how sklearn decides how many thresholds to use in precision_recall_curve. There is another post on this here: How does sklearn select threshold steps in precision recall curve?. It mentions the source code where I found this example

import numpy as np
from sklearn.metrics import precision_recall_curve
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)

然后给出

>>>precision  
    array([0.66666667, 0.5       , 1.        , 1.        ])
>>> recall
    array([1. , 0.5, 0.5, 0. ])
>>> thresholds
    array([0.35, 0.4 , 0.8 ])

有人可以通过向我展示计算的内容来向我解释如何获得这些召回率和准确率吗?

Could someone explain to me how to get those recalls and precisions by showing me what is computed?

推荐答案

我知道我在这里有点晚了,但我也有类似的疑问,即您提供的链接是否已清除.粗略地说,下面是 precision_recall_curve()sklearn 实现之后发生的事情.

I know I am a bit late here, but I had a similar doubt that the link you provided has cleared up. Roughly speaking, here is what happens inside precision_recall_curve() following sklearn implementation.

  1. 决策分数降序排列,标签按照刚刚得到的顺序:

  1. Decision scores are ordered in descending order and labels according to the just obtained order:

desc_score_indices = np.argsort(y_scores, kind="mergesort")[::-1]
y_scores = y_scores[desc_score_indices]
y_true = y_true[desc_score_indices]

你会得到:

y_scores, y_true
(array([0.8 , 0.4 , 0.35, 0.1 ]), array([1, 0, 1, 0]))

  • sklearn 实现然后预见到排除 y_scores 的重复值(在这个例子中没有重复).

  • sklearn implementation then foresees to exclude the duplicated values of y_scores (no duplicates in this example).

    distinct_value_indices = np.where(np.diff(y_scores))[0]
    threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]
    

    由于没有重复,您将得到:

    Due to the absence of duplicates you'll get:

    distinct_value_indices, threshold_idxs 
    (array([0, 1, 2], dtype=int64), array([0, 1, 2, 3], dtype=int64))
    

  • 最后,您可以计算真阳性和假阳性的数量,通过它们您可以依次计算精度和召回率.

  • Finally you can compute the number of true positives and false positives through which you can in turn compute precision and recall.

    # tps at index i being the number of positive samples assigned a score >= thresholds[i]
    tps = np.cumsum(y_true)[threshold_idxs]
    # fps at index i being the number of negative samples assigned a score >= thresholds[i], sklearn computes it as fps = 1 + threshold_idxs - tps
    fps = np.cumsum(1 - y_true)[threshold_idxs]
    y_scores = y_scores[threshold_idxs]
    

    在此步骤之后,您将拥有两个数组,其中包含每个考虑的分数的真阳性和假阳性的数量.

    After this steps you'll have two arrays with the number of true positives and false positives per considered score.

    tps, fps
    (array([1, 1, 2, 2], dtype=int32), array([0, 1, 1, 2], dtype=int32))
    

  • 最终,您可以计算准确率和召回率.

  • Eventually, you can compute precision and recall.

    precision = tps / (tps + fps)
    # tps[-1] being the total number of positive samples
    recall = tps / tps[-1]
    
    precision, recall
    (array([1.        , 0.5       , 0.66666667, 0.5       ]), array([0.5, 0.5, 1. , 1. ]))
    

    导致 thresholds 数组比 y_score 数组短(即使 y_score 中没有重复)的一个重要点是在您引用的链接中指出的那个.基本上,第一次出现 recall 的索引等于 1 定义了 thresholds 数组的长度(这里的索引为 2,对应 length=3 以及为什么长度为阈值 为 3).

    An important point that causes the thresholds array to be shorter than the y_score one (even though there are no duplicates in y_score) is the one that was pointed out within the link you referenced. Basically, the index of the first occurrence of recall equal to 1 defines the length of the thresholds array (index 2 here, corresponding to length=3 and reason why the length of thresholds is 3).

    last_ind = tps.searchsorted(tps[-1])   # 2
    sl = slice(last_ind, None, -1)         # from index 2 to 0
    
    precision, recall, thresholds = np.r_[precision[sl], 1], np.r_[recall[sl], 0], y_scores[sl]
    
    (array([0.66666667, 0.5       , 1.        , 1.        ]),
    array([1. , 0.5, 0.5, 0. ]), array([0.35, 0.4 , 0.8 ]))
    

    最后一点,precisionrecall 的长度为 4,因为精度等于 1 和召回等于 0 的值连接到获得的数组中,以便让精确召回曲线开始于对应于 y 轴.

    Last point, the length of precision and recall is 4 because values of precision equal to 1 and recall equal to 0 are concatenated to the obtained arrays in order to let the precision-recall curve start in correspondence of the y-axis.

    这篇关于sklearn precision_recall_curve 和阈值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    相关文章
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆