sklearn 的 precision_recall_curve 在小例子上不正确 [英] sklearn's precision_recall_curve incorrect on small example
问题描述
这是一个使用 precision_recall_curve() 的非常小的例子:
Here is a very small example using precision_recall_curve():
from sklearn.metrics import precision_recall_curve, precision_score, recall_score
y_true = [0, 1]
y_predict_proba = [0.25,0.75]
precision, recall, thresholds = precision_recall_curve(y_true, y_predict_proba)
precision, recall
导致:
(array([1., 1.]), array([1., 0.]))
以上与后面的手动"计算不符.
The above does not match the "manual" calculation which follows.
取决于阈值,存在三种可能的类向量:[0,0](当阈值 > 0.75 时)、[0,1](当阈值在 0.25 和 0.75 之间时)和 [1,1](当阈值<0.25).我们必须丢弃 [0,0] 因为它给出了未定义的精度(除以零).因此,将precision_score() 和recall_score() 应用于其他两个:
There are three possible class vectors depending on threshold: [0,0] (when the threshold is > 0.75) , [0,1] (when the threshold is between 0.25 and 0.75), and [1,1] (when the threshold is <0.25). We have to discard [0,0] because it gives an undefined precision (divide by zero). So, applying precision_score() and recall_score() to the other two:
y_predict_class=[0,1]
precision_score(y_true, y_predict_class), recall_score(y_true, y_predict_class)
给出:
(1.0, 1.0)
和
y_predict_class=[1,1]
precision_score(y_true, y_predict_class), recall_score(y_true, y_predict_class)
给出
(0.5, 1.0)
这似乎与 precision_recall_curve() 的输出不匹配(例如,它没有产生 0.5 的精度值).
This seems not to match the output of precision_recall_curve() (which for example did not produce a 0.5 precision value).
我错过了什么吗?
推荐答案
我知道我迟到了,但我也有同样的疑问,我最终解决了.这里的要点是,precision_recall_curve()
在第一次获得完全召回后不再输出精度和召回值;此外,它将一个 0 连接到 recall
数组,将一个 1 连接到 precision
数组,以便让曲线从对应于 y 轴的位置开始.
I know I am late, but I had your same doubt that I have eventually solved.
The main point here is that precision_recall_curve()
does not output precision and recall values anymore after full recall is obtained the first time; moreover, it concatenates a 0 to the recall
array and a 1 to the precision
array so as to let the curve start in correspondence of the y-axis.
在您的具体示例中,您将有效地完成两个这样的数组(由于 sklearn
的特定实现,它们的顺序相反):
In your specific example, you'll have effectively two arrays done like this (they are ordered the other way around because of the specific implementation of sklearn
):
precision, recall
(array([1., 0.5]), array([1., 1.]))
然后,与第二次完全召回相对应的两个数组的值被省略,1 和 0 值(分别用于精度和召回)按上述方式连接:
Then, the values of the two arrays which do correspond to the second occurrence of full recall are omitted and 1 and 0 values (for precision and recall, respectively) are concatenated as described above:
precision, recall
(array([1., 1.]), array([1., 0.]))
我已尝试在此处完整地解释它细节;另一个有用的链接当然是这个.
I have tried to explain it here in full details; another useful link is certainly this one.
这篇关于sklearn 的 precision_recall_curve 在小例子上不正确的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!