解释不平衡数据集上的 AUC、准确率和 f1 分数 [英] Interpreting AUC, accuracy and f1-score on the unbalanced dataset

查看:117
本文介绍了解释不平衡数据集上的 AUC、准确率和 f1 分数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解在数据集不平衡的情况下,AUC 如何成为比分类准确度更好的指标.
假设一个数据集包含 3 个类的 1000 个示例,如下所示:

I am trying to understand how AUC is a better metric than classification accuracy in the case when the dataset is unbalanced.
Suppose a dataset is containing 1000 examples of 3 classes as follows:

a = [[1.0, 0, 0]]*950 + [[0, 1.0, 0]]*30 + [[0, 0, 1.0]]*20

显然,这个数据是不平衡的.
一种幼稚的策略是预测属于第一类的每个点.
假设我们有一个具有以下预测的分类器:

Clearly, this data is unbalanced.
A naive strategy is to predict every point belonging to the first class.
Suppose we have a classifier with the following predictions:

b = [[0.7, 0.1, 0.2]]*1000

使用列表a中的真实标签和列表b中的预测,分类准确率为0.95.
因此,人们会认为该模型在分类任务上确实做得很好,但这并不是因为该模型预测了一个类别中的每个点.
因此,建议使用 AUC 指标来评估不平衡的数据集.
如果我们使用 TF Keras AUC 指标预测 AUC,我们得到~0.96.
如果我们使用 sklearn f1-score metric 通过设置 b=[[1,0,0]]*1000,我们得到 0.95.

With the true labels in the list a and predictions in the list b, classification accuracy is 0.95.
So one would believe that the model is really doing good on the classification task, but it is not because the model is predicting every point in one class.
Therefore, the AUC metric is suggested for evaluating an unbalanced dataset.
If we predict AUC using TF Keras AUC metric, we obtain ~0.96.
If we predict f1-score using sklearn f1-score metric by setting b=[[1,0,0]]*1000, we obtain 0.95.

现在我有点困惑,因为所有指标(准确度、AUC 和 f1 分数)都显示出很高的价值,这意味着该模型非常擅长预测任务(此处并非如此).

Now I am a little bit confused because all the metrics (Accuracy, AUC and f1-score) are showing high value which means that the model is really good at the prediction task (which is not the case here).

我在这里遗漏了哪一点以及我们应该如何解释这些值?
谢谢.

Which point I am missing here and how we should interpret these values?
Thanks.

推荐答案

您很可能使用 average='micro' 参数来计算 F1 分数.根据文档, 指定 'micro' 作为平均 startegy 将:

You are very likely using the average='micro' parameter to calculate the F1-score. According to the docs, specifying 'micro' as the averaging startegy will:

通过计算真阳性、假阴性和假阳性的总数来全局计算指标.

Calculate metrics globally by counting the total true positives, false negatives and false positives.

在保证每个测试用例只分配给一个类的分类任务中,计算微 F1 分数相当于计算准确度分数.只需检查一下:

In classification tasks where every test case is guaranteed to be assigned to exactly one class, computing a micro F1-score is equivalent to computing the accuracy score. Just check it out:

from sklearn.metrics import accuracy_score, f1_score

y_true = [[1, 0, 0]]*950 + [[0, 1, 0]]*30 + [[0, 0, 1]]*20
y_pred = [[1, 0, 0]]*1000

print(accuracy_score(y_true, y_pred)) # 0.95

print(f1_score(y_true, y_pred, average='micro')) # 0.9500000000000001

您基本上计算了两次相同的指标.通过指定 average='macro',将首先独立计算每个标签的 F1 分数,然后求平均值:

You basically computed the same metric twice. By specifying average='macro' instead, the F1-score will be computed for each label independently first, and then averaged:

print(f1_score(y_true, y_pred, average='macro')) # 0.3247863247863248

如您所见,整体 F1-score 取决于平均策略,小于 0.33 的宏观 F1-score 清楚地表明模型在预测任务中存在缺陷.

As you can see, the overall F1-score depends on the averaging strategy, and a macro F1-score of less than 0.33 is a clear indicator of a model's deficiency in the prediction task.

由于 OP 询问何时选择哪种策略,而且我认为这对其他人也可能有用,因此我将尝试详细说明这个问题.

Since the OP asked when to choose which strategy, and I think it might be useful for others as well, I will try to elaborate a bit on this issue.

scikit-learn 实际上为支持多类和多标签分类任务的平均值的指标实现了四种不同的策略.方便的是,classification_report 将返回所有申请给定分类任务的 PrecisionRecallF1-scoreem>:

scikit-learn actually implements four different stratgies for metrics that support averages for multiclass and multilabel classification tasks. Conveniently, the classification_report will return all of those that apply for a given classification task for Precision, Recall and F1-score:

from sklearn.metrics import classification_report

# The same example but without nested lists. This avoids sklearn to interpret this as a multilabel problem.
y_true = [0 for i in range(950)] + [1 for i in range(30)] + [2 for i in range(20)]
y_pred = [0 for i in range(1000)]

print(classification_report(y_true, y_pred, zero_division=0))

######################### output ####################

              precision    recall  f1-score   support

           0       0.95      1.00      0.97       950
           1       0.00      0.00      0.00        30
           2       0.00      0.00      0.00        20

    accuracy                           0.95      1000
   macro avg       0.32      0.33      0.32      1000
weighted avg       0.90      0.95      0.93      1000

所有这些都提供了不同的视角,具体取决于人们对班级分布的重视程度.

All of them provide a different perspective depending on how much emphasize one puts on the class distributions.

  1. micro average 是一种全局策略,基本上忽略了类之间的区别.如果有人真的只对真阳性、假阴性和假阳性方面的整体分歧感兴趣,而不关心类内的差异,这可能是有用的或有道理的.如前所述,如果潜在问题不是多标签分类任务,则这实际上等于准确度分数.(这也是 classification_report 函数返回 accuracy 而不是 micro avg 的原因.

  1. micro average is a global strategy that basically ignores that there is a distinction between classes. This might be useful or justified if someone is really just interested in overall disagreement in terms of true postives, false negatives and false positives, and is not concerned about differences within the classes. As hinted before, if the underlying problem is not a multilabel classification task, this actually equals the accuracy score. (This is also why the classification_report function returned accuracy instead of micro avg).

macro 平均作为一种策略将分别计算每个标签的每个指标并返回它们的未加权平均值.如果每个类都具有同等重要性,并且结果不应偏向于数据集中的任何类,则这是合适的.

macro average as a strategy will calculate each metric for each label separately and return their unweighted mean. This is suitable if each class is of equal importance and the result shall not be skewed in favor of any of the classes in the dataset.

weighted 平均也会首先分别计算每个标签的每个指标.但是平均值是根据类的支持度加权的.如果类的重要性与其重要性成正比,则这是可取的,即代表性不足的类被认为不太重要.

weighted average will also first calculate each metric for each label separately. But the average is weighted according to the classes' support. This is desirable if the importance of the classes is proportional to their importance, i.e. a class that is underrepresented is considered less important.

samples 平均值仅对多标签分类有意义,因此在本例中不由 classification_report 返回,此处也不讨论;)

samples average is only meaningful for multilabel classification and therefore not returned by classification_report in this example and also not discussed here ;)

因此,平均策略的选择和结果可信度实际上取决于类的重要性.我什至关心阶级差异吗(如果没有 --> 微观平均),如果是,所有类别都同样重要(如果是 --> 宏观平均)还是支持率更高的类别更重要(--> 加权平均).

So the choice of averaging strategy and the resulting number to trust really depends on the importance of the classes. Do I even care about class differences (if no --> micro average) and if so, are all classes equally important (if yes --> macro average) or is the class with higher support more important (--> weighted average).

这篇关于解释不平衡数据集上的 AUC、准确率和 f1 分数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆