始终保持高精度的SVM分类 [英] SVM classification with always high precision

查看:141
本文介绍了始终保持高精度的SVM分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个二进制分类问题,我正在尝试为我的分类器获取精确调用曲线.我将libsvm与RBF内核和概率估计选项一起使用.

I have a binary classification problem and I'm trying to get precision-recall curve for my classifier. I use libsvm with RBF kernel and probability estimate option.

要获取曲线,我将决策阈值从0更改为1,步长为0.1.但是在每次运行中,即使召回率随着阈值的增加而降低,我也能获得很高的精度.我的误报率似乎总是比真正的误报率低.

To get the curve I'm changing decision threshold from 0 to 1 with steps of 0.1. But on every run, I get high precision even if recall decreases with increasing threshold. My false positive rate seems always low compared to true positives.

我的结果是这些

Threshold: 0.1
TOTAL TP:393, FP:1, FN: 49
Precision:0.997462, Recall: 0.889140

Threshold: 0.2
TOTAL TP:393, FP:5, FN: 70
Precision:0.987437, Recall: 0.848812

Threshold: 0.3
TOTAL TP:354, FP:4, FN: 78
Precision:0.988827, Recall: 0.819444

Threshold: 0.4
TOTAL TP:377, FP:9, FN: 104
Precision:0.976684, Recall: 0.783784

Threshold: 0.5
TOTAL TP:377, FP:5, FN: 120
Precision:0.986911, Recall: 0.758551

Threshold: 0.6
TOTAL TP:340, FP:4, FN: 144
Precision:0.988372, Recall: 0.702479

Threshold: 0.7
TOTAL TP:316, FP:5, FN: 166
Precision:0.984424, Recall: 0.655602

Threshold: 0.8
TOTAL TP:253, FP:2, FN: 227
Precision:0.992157, Recall: 0.527083

Threshold: 0.9
TOTAL TP:167, FP:2, FN: 354
Precision:0.988166, Recall: 0.320537

这是否意味着我的分类器很好,或者某个地方存在基本错误?

Does this mean I have a good classifier or I have a fundamental mistake somewhere?

推荐答案

造成这种情况的原因之一可能是在训练数据时,您拥有很多负面样本而不是正面样本.因此,除少数几个样本外,几乎所有的样本都被归类为阴性样本.因此,您可以获得较高的准确度,即更少的误报率和较低的召回率(即更多的误报率).

One of the reasons for this could be while training the data you have lot of negative samples than positive ones. Hence, almost all the examples are being classified as negative samples except the few. Hence, you get high precision i.e. less false positives and low recall i.e. more false negatives.

现在我们知道您的阴性样品比阳性样品多:

Now that we know you have more negative samples than positive ones:

如果您查看结果,则随着阈值的增加,假阴性的数量也在增加,也就是说,您的阳性样本被归类为阴性样本,这不是一件好事.再次,这取决于您的问题,有些问题将更偏重于精度而不是查全率,有些问题将偏重于高查全率而不是精度.如果您希望精度和召回率都很高,则可能需要尝试过采样(重复正采样,使比率变为1:1)或欠采样,以解决类不平衡问题. (采用与阳性样本成比例的随机阴性样本)或诸如 SMOTE 算法(添加相似的阳性样本)等更复杂的方法.

If you look at the results, as and when you increase the threshold the number of False negatives are increasing i.e. your positive samples are classified as negative ones, which is not a good thing. Again, it depends on your problem, some problems will prefer high precision over recall, some will prefer high recall over precision. If you want both precision and recall to be high, you might need to resolve class imbalance, by trying oversampling (repeating positive samples so that ratio becomes 1:1) or undersampling (taking random negative samples in proportion with positive samples) or something more sophisticated like SMOTE algorithm (which adds similar positive samples).

此外,我确定分类器中必须有" class_weight "参数,该参数将更重视训练示例较少的班级中的错误.您可能想尝试给积极的班级比消极的班级更多的权重.

Also, I am sure there must be "class_weight" parameter in the classifier, which gives more importance to error in the class where there are less training examples. You might want to try giving more weight to positive class than negative ones.

这篇关于始终保持高精度的SVM分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆