分类:类别中的偏斜数据 [英] Classification: skewed data within a class

查看:152
本文介绍了分类:类别中的偏斜数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试构建一个多标签分类器,以预测某些输入数据为0或1的概率.我正在使用神经网络和Tensorflow + Keras(以后可能是CNN).

I'm trying to build a multilabel-classifier to predict the probabilities of some input data being either 0 or 1. I'm using a neural network and Tensorflow + Keras (maybe a CNN later).

问题如下: 数据高度偏斜.负面的例子比正面的要多得多,也许是90:10.因此,对于正例,我的神经网络几乎总是输出非常低的概率.在大多数情况下,使用二进制数字可以预测为0.

The problem is the following: The data is highly skewed. There are a lot more negative examples than positive maybe 90:10. So my neural network nearly always outputs very low probabilities for positive examples. Using binary numbers it would predict 0 in most of the cases.

几乎所有类的性能都> 95%,但这是由于它几乎总是预测零... 因此,假阴性的数量非常多.

The performance is > 95% for nearly all classes, but this is due to the fact that it nearly always predicts zero... Therefore the number of false negatives is very high.

关于如何解决此问题的一些建议?

Some suggestions how to fix this?

这是我到目前为止所考虑的想法:

Here are the ideas I considered so far:

  1. 使用自定义损失函数进一步消除假阴性(我的第一次尝试失败).与班级加权相似,班级内部的积极榜样比负面榜样更多.这类似于班级权重,但在班级内. 您将如何在Keras中实现这一点?

  1. Punishing false negatives more with a customized loss function (my first attempt failed). Similar to class weighting positive examples inside a class more than negative ones. This is similar to class weights but within a class. How would you implement this in Keras?

通过克隆正样本对样本进行过度采样,然后过度拟合神经网络,以使正样本和负样本保持平衡.

Oversampling positive examples by cloning them and then overfitting the neural network such that positive and negative examples are balanced.

提前谢谢!

推荐答案

您在正确的轨道上.

通常,您可以在训练之前平衡您的数据集,即减少人数过多的班级,或者为人数不足的班级生成人为的(扩充的)数据,以增加其发生率.

Usually, you would either balance your data set before training, i.e. reducing the over-represented class or generate artificial (augmented) data for the under-represented class to boost its occurrence.

  1. 减少过多代表的班级 这比较简单,您只需随机抽取与代表性不足的类中一样多的样本,丢弃其余样本并使用新的子集进行训练.当然,缺点是您会失去一些学习潜力,这取决于您的任务有多复杂(多少个功能).

  1. Reduce over-represented class This one is simpler, you would just randomly pick as many samples as there are in the under-represented class, discard the rest and train with the new subset. The disadvantage of course is that you're losing some learning potential, depending on how complex (how many features) your task has.

扩增数据 根据您使用的数据类型,您可以扩充"数据.这仅意味着您从数据中获取了现有样本,并对其进行了少许修改并将其用作其他样本.这对于图像数据,声音数据非常有效.您可以翻转/旋转,缩放,增加噪声,增加/减少亮度,缩放,裁剪等. 这里重要的是,您要处于现实世界中可能发生的范围之内.例如,如果您想识别"70mph速度限制"标志,那么翻转它没有任何意义,那么您将永远不会遇到实际的翻转70mph标志.如果您想识别花朵,则可以翻转或旋转它.声音也一样,稍微改变音量/频率并不重要.但是反转音轨会改变其含义",您无需在现实世界中识别出向后说的话.

Augment data Depending on the kind of data you're working with, you can "augment" data. That just means that you take existing samples from your data and slightly modify them and use them as additional samples. This works very well with image data, sound data. You could flip/rotate, scale, add-noise, in-/decrease brightness, scale, crop etc. The important thing here is that you stay within bounds of what could happen in the real world. If for example you want to recognize a "70mph speed limit" sign, well, flipping it doesn't make sense, you will never encounter an actual flipped 70mph sign. If you want to recognize a flower, flipping or rotating it is permissible. Same for sound, changing volume / frequency slighty won't matter much. But reversing the audio track changes its "meaning" and you won't have to recognize backwards spoken words in the real world.

现在,如果您必须扩充表格数据(例如销售数据,元数据等),这将变得非常棘手,因为您必须小心地 not 隐式地将自己的假设输入模型.

Now if you have to augment tabular data like sales data, metadata, etc... that's much trickier as you have to be careful not to implicitly feed your own assumptions into the model.

这篇关于分类:类别中的偏斜数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆