如何使用 scikit learn 计算多类案例的准确率、召回率、准确率和 f1 分数? [英] How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

查看:44
本文介绍了如何使用 scikit learn 计算多类案例的准确率、召回率、准确率和 f1 分数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理情感分析问题,数据如下所示:

I'm working in a sentiment analysis problem the data looks like this:

label instances
    5    1190
    4     838
    3     239
    1     204
    2     127

所以我的数据是不平衡的,因为 1190 个 instances 被标记为 5.对于使用 scikit 的 SVC 的分类我.问题是我不知道如何以正确的方式平衡我的数据,以便准确计算多类案例的准确率、召回率、准确率和 f1 分数.所以我尝试了以下方法:

So my data is unbalanced since 1190 instances are labeled with 5. For the classification Im using scikit's SVC. The problem is I do not know how to balance my data in the right way in order to compute accurately the precision, recall, accuracy and f1-score for the multiclass case. So I tried the following approaches:

首先:

    wclf = SVC(kernel='linear', C= 1, class_weight={1: 10})
    wclf.fit(X, y)
    weighted_prediction = wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, weighted_prediction)
print 'F1 score:', f1_score(y_test, weighted_prediction,average='weighted')
print 'Recall:', recall_score(y_test, weighted_prediction,
                              average='weighted')
print 'Precision:', precision_score(y_test, weighted_prediction,
                                    average='weighted')
print '
 clasification report:
', classification_report(y_test, weighted_prediction)
print '
 confussion matrix:
',confusion_matrix(y_test, weighted_prediction)

第二:

auto_wclf = SVC(kernel='linear', C= 1, class_weight='auto')
auto_wclf.fit(X, y)
auto_weighted_prediction = auto_wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, auto_weighted_prediction)

print 'F1 score:', f1_score(y_test, auto_weighted_prediction,
                            average='weighted')

print 'Recall:', recall_score(y_test, auto_weighted_prediction,
                              average='weighted')

print 'Precision:', precision_score(y_test, auto_weighted_prediction,
                                    average='weighted')

print '
 clasification report:
', classification_report(y_test,auto_weighted_prediction)

print '
 confussion matrix:
',confusion_matrix(y_test, auto_weighted_prediction)

第三:

clf = SVC(kernel='linear', C= 1)
clf.fit(X, y)
prediction = clf.predict(X_test)


from sklearn.metrics import precision_score, 
    recall_score, confusion_matrix, classification_report, 
    accuracy_score, f1_score

print 'Accuracy:', accuracy_score(y_test, prediction)
print 'F1 score:', f1_score(y_test, prediction)
print 'Recall:', recall_score(y_test, prediction)
print 'Precision:', precision_score(y_test, prediction)
print '
 clasification report:
', classification_report(y_test,prediction)
print '
 confussion matrix:
',confusion_matrix(y_test, prediction)


F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1082: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
 0.930416613529

但是,我收到了这样的警告:

However, Im getting warnings like this:

/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172:
DeprecationWarning: The default `weighted` averaging is deprecated,
and from version 0.18, use of precision, recall or F-score with 
multiclass or multilabel data or pos_label=None will result in an 
exception. Please set an explicit value for `average`, one of (None, 
'micro', 'macro', 'weighted', 'samples'). In cross validation use, for 
instance, scoring="f1_weighted" instead of scoring="f1"

如何正确处理不平衡的数据,以便以正确的方式计算分类器的指标?

How can I deal correctly with my unbalanced data in order to compute in the right way classifier's metrics?

推荐答案

我认为对于权重用于什么存在很多混淆.我不确定我确切地知道是什么困扰着你,所以我将涵盖不同的主题,请耐心等待;)

I think there is a lot of confusion about which weights are used for what. I am not sure I know precisely what bothers you so I am going to cover different topics, bear with me ;).

来自class_weight 参数的权重用于训练分类器.它们不用于计算您使用的任何指标:对于不同的类权重,数字也会不同,因为分类器不同.

The weights from the class_weight parameter are used to train the classifier. They are not used in the calculation of any of the metrics you are using: with different class weights, the numbers will be different simply because the classifier is different.

基本上在每个 scikit-learn 分类器中,类权重用于告诉您的模型一个类的重要性.这意味着在训练过程中,分类器会加倍努力对高权重的类进行适当的分类.
他们如何做到这一点是特定于算法的.如果您想了解有关 SVC 如何工作的详细信息,而该文档对您没有意义,请随时提及.

Basically in every scikit-learn classifier, the class weights are used to tell your model how important a class is. That means that during the training, the classifier will make extra efforts to classify properly the classes with high weights.
How they do that is algorithm-specific. If you want details about how it works for SVC and the doc does not make sense to you, feel free to mention it.

一旦你有了一个分类器,你就想知道它的表现如何.您可以在这里使用您提到的指标:accuracyrecall_scoref1_score...

Once you have a classifier, you want to know how well it is performing. Here you can use the metrics you mentioned: accuracy, recall_score, f1_score...

通常当类别分布不平衡时,准确性被认为是一个糟糕的选择,因为它会给仅预测最频繁类别的模型提供高分.

Usually when the class distribution is unbalanced, accuracy is considered a poor choice as it gives high scores to models which just predict the most frequent class.

我不会详细说明所有这些指标,但请注意,除了 accuracy,它们自然会应用于类级别:正如您在此 print 中看到的在分类报告中,它们是为每个类定义的.它们依赖于诸如 true positivesfalse negative 之类的概念,这些概念需要定义哪个类是 positive 类.

I will not detail all these metrics but note that, with the exception of accuracy, they are naturally applied at the class level: as you can see in this print of a classification report they are defined for each class. They rely on concepts such as true positives or false negative that require defining which class is the positive one.

             precision    recall  f1-score   support

          0       0.65      1.00      0.79        17
          1       0.57      0.75      0.65        16
          2       0.33      0.06      0.10        17
avg / total       0.52      0.60      0.51        50

警告

F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The 
default `weighted` averaging is deprecated, and from version 0.18, 
use of precision, recall or F-score with multiclass or multilabel data  
or pos_label=None will result in an exception. Please set an explicit 
value for `average`, one of (None, 'micro', 'macro', 'weighted', 
'samples'). In cross validation use, for instance, 
scoring="f1_weighted" instead of scoring="f1".

您收到此警告是因为您使用了 f1 分数、召回率和准确率,而没有定义它们的计算方式!问题可以改写为:从上述分类报告中,您如何为 f1-score 输出一个全局数字?你可以:

You get this warning because you are using the f1-score, recall and precision without defining how they should be computed! The question could be rephrased: from the above classification report, how do you output one global number for the f1-score? You could:

  1. 取每个类的 f1 分数的平均值:这就是上面的 avg/total 结果.它也称为宏观平均.
  2. 使用真阳性/假阴性等的全局计数计算 f1 分数(您将每个类别的真阳性/假阴性的数量相加).也就是平均.
  3. 计算 f1 分数的加权平均值.在 scikit-learn 中使用 'weighted' 将通过类的支持权衡 f1-score:一个类拥有的元素越多,这个类在计算中的 f1-score 就越重要.
  1. Take the average of the f1-score for each class: that's the avg / total result above. It's also called macro averaging.
  2. Compute the f1-score using the global count of true positives / false negatives, etc. (you sum the number of true positives / false negatives for each class). Aka micro averaging.
  3. Compute a weighted average of the f1-score. Using 'weighted' in scikit-learn will weigh the f1-score by the support of the class: the more elements a class has, the more important the f1-score for this class in the computation.

这些是 scikit-learn 中的 3 个选项,警告是说您必须选择一个.所以你必须为 score 方法指定一个 average 参数.

These are 3 of the options in scikit-learn, the warning is there to say you have to pick one. So you have to specify an average argument for the score method.

你选择哪一个取决于你想如何衡量分类器的性能:例如宏观平均不考虑类别不平衡,类别 1 的 f1-score 与 f1-score 一样重要第 5 类的分数.但是,如果您使用加权平均,则第 5 类的重要性会更高.

Which one you choose is up to how you want to measure the performance of the classifier: for instance macro-averaging does not take class imbalance into account and the f1-score of class 1 will be just as important as the f1-score of class 5. If you use weighted averaging however you'll get more importance for the class 5.

这些指标中的整个参数规范目前在 scikit-learn 中还不是很清楚,根据文档,它会在 0.18 版中变得更好.他们正在删除一些不明显的标准行为,并发布警告以便开发人员注意到它.

The whole argument specification in these metrics is not super-clear in scikit-learn right now, it will get better in version 0.18 according to the docs. They are removing some non-obvious standard behavior and they are issuing warnings so that developers notice it.

最后我想说的(如果你知道的话可以跳过它)是,分数只有在分类器从未见过的数据上计算时才有意义.这一点非常重要,因为您在用于拟合分类器的数据上获得的任何分数都完全无关紧要.

Last thing I want to mention (feel free to skip it if you're aware of it) is that scores are only meaningful if they are computed on data that the classifier has never seen. This is extremely important as any score you get on data that was used in fitting the classifier is completely irrelevant.

这里有一种使用 StratifiedShuffleSplit 的方法,它为您提供数据的随机拆分(在改组后),以保留标签分布.

Here's a way to do it using StratifiedShuffleSplit, which gives you a random splits of your data (after shuffling) that preserve the label distribution.

from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

# We use a utility to generate artificial classification data.
X, y = make_classification(n_samples=100, n_informative=10, n_classes=3)
sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
for train_idx, test_idx in sss:
    X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx]
    svc.fit(X_train, y_train)
    y_pred = svc.predict(X_test)
    print(f1_score(y_test, y_pred, average="macro"))
    print(precision_score(y_test, y_pred, average="macro"))
    print(recall_score(y_test, y_pred, average="macro"))    

希望这会有所帮助.

这篇关于如何使用 scikit learn 计算多类案例的准确率、召回率、准确率和 f1 分数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆