roc_auc_score() 和 auc() 的不同结果 [英] Different result with roc_auc_score() and auc()

查看:53
本文介绍了roc_auc_score() 和 auc() 的不同结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法理解 scikit-learn 中 roc_auc_score()auc() 之间的区别(如果有的话).

I have trouble understanding the difference (if there is one) between roc_auc_score() and auc() in scikit-learn.

我想预测具有不平衡类别的二进制输出(Y=1 时约为 1.5%).

Im tying to predict a binary output with imbalanced classes (around 1.5% for Y=1).

model_logit = LogisticRegression(class_weight='auto')
model_logit.fit(X_train_ridge, Y_train)

Roc 曲线

false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, clf.predict_proba(xtest)[:,1])

AUC

auc(false_positive_rate, true_positive_rate)
Out[490]: 0.82338034042531527

roc_auc_score(Y_test, clf.predict(xtest))
Out[493]: 0.75944737191205602

有人可以解释这种差异吗?我认为两者都只是在计算 ROC 曲线下的面积.可能是因为数据集不平衡,但我不知道为什么.

Somebody can explain this difference ? I thought both were just calculating the area under the ROC curve. Might be because of the imbalanced dataset but I could not figure out why.

谢谢!

推荐答案

AUC 并不总是 ROC 曲线下的面积.Area Under the Curve是some曲线下的(抽象的)区域,所以比AUROC更笼统.对于不平衡的类别,最好为精确召回曲线找到 AUC.

AUC is not always area under the curve of a ROC curve. Area Under the Curve is an (abstract) area under some curve, so it is a more general thing than AUROC. With imbalanced classes, it may be better to find AUC for a precision-recall curve.

查看 roc_auc_score 的 sklearn 源代码:

See sklearn source for roc_auc_score:

def roc_auc_score(y_true, y_score, average="macro", sample_weight=None):
    # <...> docstring <...>
    def _binary_roc_auc_score(y_true, y_score, sample_weight=None):
            # <...> bla-bla <...>

            fpr, tpr, tresholds = roc_curve(y_true, y_score,
                                            sample_weight=sample_weight)
            return auc(fpr, tpr, reorder=True)

    return _average_binary_score(
        _binary_roc_auc_score, y_true, y_score, average,
        sample_weight=sample_weight) 

如你所见,这首先得到一个roc曲线,然后调用auc()得到面积.

As you can see, this first gets a roc curve, and then calls auc() to get the area.

我猜你的问题是 predict_proba() 调用.对于普通的 predict(),输出总是相同的:

I guess your problem is the predict_proba() call. For a normal predict() the outputs are always the same:

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, roc_auc_score

est = LogisticRegression(class_weight='auto')
X = np.random.rand(10, 2)
y = np.random.randint(2, size=10)
est.fit(X, y)

false_positive_rate, true_positive_rate, thresholds = roc_curve(y, est.predict(X))
print auc(false_positive_rate, true_positive_rate)
# 0.857142857143
print roc_auc_score(y, est.predict(X))
# 0.857142857143

如果您为此更改上述内容,有时会得到不同的输出:

If you change the above for this, you'll sometimes get different outputs:

false_positive_rate, true_positive_rate, thresholds = roc_curve(y, est.predict_proba(X)[:,1])
# may differ
print auc(false_positive_rate, true_positive_rate)
print roc_auc_score(y, est.predict(X))

这篇关于roc_auc_score() 和 auc() 的不同结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆