用于不平衡多类多标签分类的神经网络 [英] Neural Network for Imbalanced Multi-Class Multi-Label Classification

查看:36
本文介绍了用于不平衡多类多标签分类的神经网络的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何处理在训练神经网络时产生不平衡结果的多标签分类?我遇到的解决方案之一是惩罚罕见标记类的错误.这是我设计网络的方式:

How to deal with mutli-label classification which has imbalanced results while training neural networks ? One of the solutions that I came across was to penalize the error for rare labeled classes. Here is what how i designed the network :

类数:100.输入层、第1隐藏层和第2层(100)与dropouts和ReLU全连接.第二个隐藏层的输出是py_x.

Number of classes: 100. Input layer, 1st hidden layer and 2nd layer(100) are fully-connected with drop-outs and ReLU. The output of the 2nd hidden layer is py_x.

cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=py_x, labels=Y))

其中 Y 是 one-hot-encoding 的修改版本,为样本的所有标签设置了 1 到 5 之间的值.最常见标签的值约为 1,最稀有标签的值约为 5.该值不是离散的,即在one-hot-encoding中要为标签设置的新值基于公式

Where Y is a modified version of one-hot-encoding with values between 1 to 5 set for all the labels of a sample. The value would ~1 for the most frequent label and ~5 for rarest labels. The value are not discrete, i.e new value to be set of a label in the one-hot-encoding is based on the formula

= 1 + 4*(1-(percentage of label/100))

例如: <0, 0, 1, 0, 1, .... > 将被转换为类似 <0, 0, 1.034, 0, 3.667, ...> 的内容.注意:仅更改原始向量中的 1 值.

For example: <0, 0, 1, 0, 1, .... > would be converted to something like <0, 0, 1.034, 0, 3.667, ...> . NOTE : only the values of 1 in the original vectors are changed.

通过这种方式,如果模型错误地预测了一个罕见的标签,它的误差会很高,例如:0.0001 - 5 = -4.9999,与错误标记非常频繁的标签相比,这会反向传播更严重的误差.

This way if the model incorrectly predicts a rare label its error would be high, for ex: 0.0001 - 5 = -4.9999, and this would back-propagate a heavier error as compared to a mislabeling of very frequent label.

这是惩罚的正确方式吗?有没有更好的方法来处理这个问题?

Is this the right way to penalize ? Are there any better methods to deal with this problem ?

推荐答案

让我们以一般形式回答您的问题.你面临的是阶级不平衡问题,有很多方法可以解决这个问题.常用的方法有:

Let's answer your problem in the general form. What you are facing is the class imbalance problem and there are many ways to tackle this problem. Common ways are:

  1. 数据集重采样:通过改变数据集大小来平衡类.
    例如,如果您有 5 个目标类(A 到 E 类),A、B、C 和 D 类各有 1000 个示例,E 类有 10 个示例,则您可以简单地从 E 类中添加 990 个示例(只需复制它或复制和一些噪音).
  2. 成本敏感建模:改变不同类别的重要性(权重).
    这是您在代码中使用的方法,您将类的重要性(权重)最多增加了 5 倍.
  1. Dataset Resampling: Make the classes balanced by changing the dataset size.
    For example, if you have 5 target classes(class A to E), and class A, B, C, and D have 1000 examples each and class E has 10 examples, you can simply add 990 more examples from class E(just copy it or copy and some noise to it).
  2. Cost-Sensitive Modeling: Change the importance(weight) of different classes.
    This is the method you have used in your code where you increased the importance(weight) of a class by a factor of at most 5.

回到您的问题,第一个解决方案与您的模型无关.您只需要检查您是否能够更改数据集(向样本较少的类添加更多样本或从样本较多的类中删除样本).对于第二种解决方案,由于您正在使用神经网络,因此您必须更改损失函数公式.您可以定义多个超参数(类权重或重要性)并训练您的模型并查看哪组参数效果更好.

Returning to your problem, the first solution is independent of your model. You just need to check if you are able to change the dataset(add more samples to classes with fewer samples or remove samples from classes with lots of samples). For the second solution, since you are working with a neural network, you have to change your loss function formula. You can define multiple hyperparameters(class weights or importance) and train your model and see which set of parameters work better.

所以要回答你的问题,是的,这是一种正确的惩罚方式,但也许你可以通过尝试不同的权重(而不是你的例子中的 5)来获得更好的准确性.此外,您可能想尝试数据集重采样.

So to answer your question, yes this is a right way to penalize but maybe you get better accuracy by trying different weights(instead of 5 in your example). Also, you might want to try dataset resampling.

更多信息,您可以参考此链接.

For more information, you can refer to this link.

这篇关于用于不平衡多类多标签分类的神经网络的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆