sklearn precision_recall_curve 和阈值 [英] sklearn precision_recall_curve and threshold
问题描述
我想知道 sklearn 如何决定在 precision_recall_curve 中使用多少个阈值.这里有另一篇文章:sklearn 如何选择精确召回曲线中的阈值步长?.它提到了我发现这个例子的源代码
I was wondering how sklearn decides how many thresholds to use in precision_recall_curve. There is another post on this here: How does sklearn select threshold steps in precision recall curve?. It mentions the source code where I found this example
import numpy as np
from sklearn.metrics import precision_recall_curve
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
然后给出
>>>precision
array([0.66666667, 0.5 , 1. , 1. ])
>>> recall
array([1. , 0.5, 0.5, 0. ])
>>> thresholds
array([0.35, 0.4 , 0.8 ])
有人可以通过向我展示计算的内容来向我解释如何获得这些召回率和准确率吗?
Could someone explain to me how to get those recalls and precisions by showing me what is computed?
推荐答案
我知道我在这里有点晚了,但我也有类似的疑问,即您提供的链接是否已清除.粗略地说,下面是 precision_recall_curve()
在 sklearn
实现之后发生的事情.
I know I am a bit late here, but I had a similar doubt that the link you provided has cleared up. Roughly speaking, here is what happens inside precision_recall_curve()
following sklearn
implementation.
决策分数降序排列,标签按照刚刚得到的顺序:
Decision scores are ordered in descending order and labels according to the just obtained order:
desc_score_indices = np.argsort(y_scores, kind="mergesort")[::-1]
y_scores = y_scores[desc_score_indices]
y_true = y_true[desc_score_indices]
你会得到:
y_scores, y_true
(array([0.8 , 0.4 , 0.35, 0.1 ]), array([1, 0, 1, 0]))
sklearn
实现然后预见到排除 y_scores
的重复值(在这个例子中没有重复).
sklearn
implementation then foresees to exclude the duplicated values of y_scores
(no duplicates in this example).
distinct_value_indices = np.where(np.diff(y_scores))[0]
threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]
由于没有重复,您将得到:
Due to the absence of duplicates you'll get:
distinct_value_indices, threshold_idxs
(array([0, 1, 2], dtype=int64), array([0, 1, 2, 3], dtype=int64))
最后,您可以计算真阳性和假阳性的数量,通过它们您可以依次计算精度和召回率.
Finally you can compute the number of true positives and false positives through which you can in turn compute precision and recall.
# tps at index i being the number of positive samples assigned a score >= thresholds[i]
tps = np.cumsum(y_true)[threshold_idxs]
# fps at index i being the number of negative samples assigned a score >= thresholds[i], sklearn computes it as fps = 1 + threshold_idxs - tps
fps = np.cumsum(1 - y_true)[threshold_idxs]
y_scores = y_scores[threshold_idxs]
在此步骤之后,您将拥有两个数组,其中包含每个考虑的分数的真阳性和假阳性的数量.
After this steps you'll have two arrays with the number of true positives and false positives per considered score.
tps, fps
(array([1, 1, 2, 2], dtype=int32), array([0, 1, 1, 2], dtype=int32))
最终,您可以计算准确率和召回率.
Eventually, you can compute precision and recall.
precision = tps / (tps + fps)
# tps[-1] being the total number of positive samples
recall = tps / tps[-1]
precision, recall
(array([1. , 0.5 , 0.66666667, 0.5 ]), array([0.5, 0.5, 1. , 1. ]))
导致 thresholds
数组比 y_score
数组短(即使 y_score
中没有重复)的一个重要点是在您引用的链接中指出的那个.基本上,第一次出现 recall
的索引等于 1 定义了 thresholds
数组的长度(这里的索引为 2,对应 length=3 以及为什么长度为阈值
为 3).
An important point that causes the thresholds
array to be shorter than the y_score
one (even though there are no duplicates in y_score
) is the one that was pointed out within the link you referenced. Basically, the index of the first occurrence of recall
equal to 1 defines the length of the thresholds
array (index 2 here, corresponding to length=3 and reason why the length of thresholds
is 3).
last_ind = tps.searchsorted(tps[-1]) # 2
sl = slice(last_ind, None, -1) # from index 2 to 0
precision, recall, thresholds = np.r_[precision[sl], 1], np.r_[recall[sl], 0], y_scores[sl]
(array([0.66666667, 0.5 , 1. , 1. ]),
array([1. , 0.5, 0.5, 0. ]), array([0.35, 0.4 , 0.8 ]))
最后一点,precision
和 recall
的长度为 4,因为精度等于 1 和召回等于 0 的值连接到获得的数组中,以便让精确召回曲线开始于对应于 y 轴.
Last point, the length of precision
and recall
is 4 because values of precision equal to 1 and recall equal to 0 are concatenated to the obtained arrays in order to let the precision-recall curve start in correspondence of the y-axis.
这篇关于sklearn precision_recall_curve 和阈值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!