平衡数据集中AUC高而准确性低的原因 [英] Reason of having high AUC and low accuracy in a balanced dataset

查看:2213
本文介绍了平衡数据集中AUC高而准确性低的原因的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个平衡的数据集(两个类的大小相同),将其拟合到SVM模型中,我得到的AUC值较高(〜0.9),但准确度较低(〜0.5).

Given a balanced dataset (size of both classes are the same), fitting it into an SVM model I yield a high AUC value (~0.9) but a low accuracy (~0.5).

我完全不知道为什么会发生这种情况,有人能为我解释一下这种情况吗?

I have totally no idea why would this happen, can anyone explain this case for me?

推荐答案

我最近偶然发现了同一个问题.这是我为自己找到的-如果我错了,请告诉我.

I recently stumbled upon the same question. Here is what I figured out for myself - let me know if I'm wrong.

在思考为什么ROC曲线(AUC)下的面积可以较高而精度较低时,让我们首先概括一下这些术语的含义.

Before we ponder why the area under the ROC curve (AUC) can be high while accuracy is low, let's first recapitulate the meanings of these terms.

接收器-操作员特征(ROC)曲线绘制了误报率FPR( t )相对于真报率TPR( t ),以适应变化的决策阈值(或预测临界值) t .

The receiver-operator characteristic (ROC) curve plots the false positive rate FPR(t) against the true positive rate TPR(t), for varying decision thresholds (or prediction cutoffs) t.

TPR和FPR的定义如下:

TPR and FPR are defined as follows:

TPR = TP / P = TP / (TP+FN) = number of true positives / number of positives
FPR = FP / N = FP / (FP+TN) = number of false positives / number of negatives

在ROC分析中,假设可以将分类器简化为以下功能行为:

In the ROC analysis, it is assumed that the classifier can be reduced to the following functional behavior:

def classifier(observation, t):
    if score_function(observation) <= t: 
        observation belongs to the "negative" class A
    else:           
        observation belongs to the "positive" class B

将决策阈值 t 视为自由参数,可在训练分类器时进行调整. (并非所有的分类器都具有直接的参数化功能,但是对于已知的逻辑回归或简单的阈值法来说,选择这样的参数 t 显然是明智的选择.)在训练过程中,最佳阈值<选择em> t * 以使某些成本函数最小化.

Think of the decision threshold t as a free parameter that is adjusted when training a classifier. (Not all classifiers have a straightforward parametrization, but for know stick with logistic regression or simple thresholding, for which there is an obvious choice for such a parameter t.) During the training process, the optimal threshold t* is chosen such that some cost function is minimized.

给定训练/测试数据,请注意,参数 t 的任何选择都可以确定哪些数据点是正阳性(TP),假阳性(FP),真阴性(TN)或假负片(FN).因此, t 的任何选择也将确定FPR( t )和TPR( t ).

Given the training/test data, note that any choice of parameter t determines which of the data points are true positives (TP), false positives (FP), true negatives (TN) or false negatives (FN). Hence, any choice of t determines also the FPR(t) and TPR(t).

因此,我们看到了以下内容:ROC曲线表示由决策阈值 t 参数化的曲线,其中x = FPR( t )和y = TPR( t )表示 t 的所有可能值.

So we've seen the following: A ROC curve represents a curve parametrized by the decision threshold t, where x = FPR(t) and y = TPR(t) for all possible values for t.

所得ROC曲线下的面积称为AUC.它测量您的训练/测试数据,分类器如何区分阳性"和阴性"类别的样本.完美分类器的ROC曲线将通过最优点FPR( t * )= 0且TPR( t * )= 1,并且AUC为1.但是,分类器的ROC遵循对角线FPR( t )= TPR( t ),AUC为0.5.

The area under the resulting ROC curve is called AUC. It measures for your training/test data, how well the classifier can discriminate between samples from the "positive" and the "negative" class. A perfect classifier's ROC curve would pass through the optimal point FPR(t*) = 0 and TPR(t*) = 1 and would yield an AUC of 1. A random classifier's ROC, however, follows the diagonal FPR(t)=TPR(t), yielding an AUC of 0.5.

最后,准确性定义为所有正确标记的病例与病例总数的比率:

Finally, accuracy is defined as the ratio of all correctly labeled cases and the total number of cases:

accuracy = (TP+TN)/(Total number of cases) = (TP+TN)/(TP+FP+TN+FN)

那怎么可能是AUC大而同时精度又低呢?那么如果您的分类器在正分类(高AUC)上获得了良好的性能,则可能会发生这种情况高假阴性率(或少量真阴性)的成本.

So how can it be that the AUC is large while the accuracy is low at the same time? Well this may happen if your classifier achieves the good performance on the positive class (high AUC) at the cost of a high false negatives rate (or a low number of true negative).

为什么训练过程导致分类器的预测性能如此之差的问题却是另外一个问题,它特定于您的问题/数据和您使用的分类方法.

The question why the training process led to a classifier with such a poor prediction performance is a different one and is specific to your problem/data and the classification methods you used.

总而言之,ROC分析告诉您正类样本与另一类样本的分离程度如何,而预测准确性则暗示了分类器的实际性能.

In summary, the ROC analysis tells you something about how well the samples of the positive class can be separated from the other class, while the prediction accuracy hints on the actual performance of your classifier.

这篇关于平衡数据集中AUC高而准确性低的原因的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆