sklearn如何计算两个二进制输入的roc曲线下的面积? [英] How does sklearn calculate the area under the roc curve for two binary inputs?

查看:36
本文介绍了sklearn如何计算两个二进制输入的roc曲线下的面积?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到sklearn有以下功能:

I noticed that sklearn has the following function:

sklearn.metrics.roc_auc_score()

将 ground_truth 和预测作为输入.

which takes as input ground_truth and prediction.

例如,

ground_truth = [1,1,0,0,0]
prediction = [1,1,0,0,0]

sklearn.metrics.roc_auc_score(ground_truth, prediction) 返回 1

我的问题是我无法弄清楚 sklearn 如何使用两个二进制输入计算 ROC 曲线下的面积.ROC曲线不是通过移动类分配阈值,并计算每个阈值的误报和命中率得出的吗?有两个二进制输入,你不应该只有一个(误报、命中率)测量吗?

My problem is that I can't figure out how sklearn calculates the area under the ROC curve with two binary inputs. Isn't the ROC curve derived by moving the class assignment threshold, and calculating the false alarm and hit rate for each threshold? With two binary inputs, shouldn't you only have one (false alarm, hit rate) measurement?

非常感谢!

推荐答案

您是对的,对于二元预测,您将只有一个曲线阈值/测量值.我自己也不明白,所以我在 sklearn 教程和一个纯二进制示例中使用大量打印语句运行代码.所有的魔法都发生在 sklearn.metrics._binary_clf_curve

You're correct that with binary predictions you'll only have a single threshold/measurement for the curve. I didn't understand it myself so I ran the code with a ton of print statements both for the sklearn tutorial and then with a purely binary example. All the magic is happening in sklearn.metrics._binary_clf_curve

阈值"是不同的预测分数.对于任何输出纯 1 和 0 的二进制分类器,您将获得两个阈值 - 1 和 0(它们在内部从最高到最低排序).在 1 阈值处,>=1 的预测分数为真,低于此值(在这种情况下仅为 0)的任何内容都被认为是错误的,并且由此计算出 TP 和 FP 率.在所有情况下,最后一个阈值将所有内容归类为真,因此 TP 和 FP 率都为 1.

The "thresholds" are distinct prediction scores. For any binary classifier that outputs purely ones and zeros you're going to get two thresholds - 1 and 0 (they're sorted internally from highest to lowest). At the 1 threshold, a prediction score of >=1 is true and anything below that (only 0 in this case) is considered false, and the TP and FP rates are calculated from that. In all cases, the last threshold categorizes everything as true so the TP and FP rates will both be 1.

看来,要为 sklearn 分类器生成正确的 ROC 曲线,您应该使用 clf.predict_proba() 而不是 predict().或者,也许 predict_log_proba()?不知道有没有区别

It appears then that to generate a correct ROC curve for a sklearn classifier you'd use clf.predict_proba() rather than predict(). Or, maybe predict_log_proba()? I'm not sure if it would make any difference

这篇关于sklearn如何计算两个二进制输入的roc曲线下的面积?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆