标签不平衡-混淆矩阵效果更好 [英] Unbalanced labels - Better results in Confusion Matrix

查看:223
本文介绍了标签不平衡-混淆矩阵效果更好的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的标签不平衡.也就是说,在二进制分类器中,我拥有更多的正数(1)数据和更少的负数(0)数据.我正在使用分层K折交叉验证",将真实的负数设为零.您能否让我知道,对于真正的负数,我必须具有哪些选项才能使棕褐色零值更大?

I've unbalanced labels. That is, in binary classifier, I've more positives (1) data and less negatives (0) data. I'm using Stratified K Fold Cross Validation and getting true negatives as zero. Could you please let me know what options I have to get a value greater tan zero for true negatives?

推荐答案

有很多处理不平衡类的策略.

There are quite a lot of strategies for dealing with imbalanced classes.

首先,让我们了解(可能)正在发生的事情.您要让分类器最大程度地提高准确性:也就是说,正确分类的记录所占的比例.例如,如果85%的记录在Class A中,那么仅将所有内容标记为Class A就可以达到85%的准确性.这似乎是分类器可以达到的最佳效果.

First, let's understand what is (probably) happening. You are asking your classifier to maximize accuracy: that is, the fraction of records that were correctly classified. If, say, 85% of the records are in Class A, then you will get 85% accuracy by just labelling everything as Class A. And this seems to be the best the classifier can achieve.

那么,您如何对此进行更正?

So, how can you correct for this?

1)您可以尝试在平衡的数据子集中训练模型.从多数类中随机抽取仅与少数类中存在的记录数量相等的记录.这将使您的分类器无法将所有内容都标记为多数类.但这将以减少可用于发现类边界结构的信息为代价.

1) You can try training you model on a balanced sub-set of your data. Randomly sample from the majority class only a number of records equal to those present in the minority class. This won't allow your classifier to get away with labelling everything as the majority class. But it will come at the cost of having less information available to discover the structure of the class boundary.

2)使用不同于准确性的优化指标.流行的选择是 AUC

2) Use a different optimization metric than accuracy. Popular choices are AUC or F1 Score

3)使用方法1创建一个分类器集合.每个分类器将看到数据的一个子集并在类中投票",可能带有一些置信度得分.这些分类器输出中的每一个将是最终元分类器的功能(可能使用方法2构建).这样,您就可以访问所有可用信息.

3) Create an ensemble of classifiers using method 1. Each classifier will see a subset of the data and 'vote' on a class, possibly with some confidence score. Each of these classifier outputs will be a feature for a final meta-classifier (possibly build using method 2). This way you can get access to all of the information available.

这远远不是详尽的解决方案列表.使用不平衡(或倾斜")的数据集可能是一本完整的教科书.我建议阅读一些有关此主题的论文.也许从此处

This is far from an exhaustive list of solutions. Working with imbalanced (or 'skewed') datasets could be an entire text book. I would recommend reading some papers on this topic. Perhaps starting here

这篇关于标签不平衡-混淆矩阵效果更好的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆