神经网络用于不均衡的多类别多标签分类 [英] Neural Network for Imbalanced Multi-Class Multi-Label Classification

查看:488
本文介绍了神经网络用于不均衡的多类别多标签分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在训练神经网络时如何处理结果不平衡的多标签分类?我遇到的解决方案之一是对标记为稀有类的错误进行惩罚.这是我设计网络的方式:

How to deal with mutli-label classification which has imbalanced results while training neural networks ? One of the solutions that I came across was to penalize the error for rare labeled classes. Here is what how i designed the network :

类数:100.输入层,第一隐藏层和第二层(100)与分支和ReLU完全连接.第二个隐藏层的输出为py_x.

Number of classes: 100. Input layer, 1st hidden layer and 2nd layer(100) are fully-connected with drop-outs and ReLU. The output of the 2nd hidden layer is py_x.

cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=py_x, labels=Y))

其中Y是单次热编码的修改版本,为样本的所有标签设置了1到5之间的值.对于最频繁的标签,该值约为1,对于最稀有的标签,该值为〜5.该值不是离散的,即在一次热编码中要为标签设置的新值基于公式

Where Y is a modified version of one-hot-encoding with values between 1 to 5 set for all the labels of a sample. The value would ~1 for the most frequent label and ~5 for rarest labels. The value are not discrete, i.e new value to be set of a label in the one-hot-encoding is based on the formula

= 1 + 4*(1-(percentage of label/100))

例如:< 0,0,1,0,1,....>将转换为< 0,0,1.034,0,3.667,...>之类的东西.注意:仅更改原始向量中的1值.

For example: <0, 0, 1, 0, 1, .... > would be converted to something like <0, 0, 1.034, 0, 3.667, ...> . NOTE : only the values of 1 in the original vectors are changed.

这样,如果模型错误地预测了稀有标签,则其错误将很高,例如:0.0001-5 = -4.9999,并且与非常频繁标签的错误标签相比,这将反向传播更严重的错误.

This way if the model incorrectly predicts a rare label its error would be high, for ex: 0.0001 - 5 = -4.9999, and this would back-propagate a heavier error as compared to a mislabeling of very frequent label.

这是正确的处罚方式吗?有没有更好的方法来解决这个问题?

Is this the right way to penalize ? Are there any better methods to deal with this problem ?

推荐答案

让我们以一般形式回答您的问题.您面临的是班级不平衡问题,有很多方法可以解决此问题.常见的方法是:

Let's answer your problem in the general form. What you are facing is the class imbalance problem and there are many ways to tackle this problem. Common ways are:

  1. 数据集重采样:通过更改数据集大小来平衡类.
    例如,如果您有5个目标类(A至E类),而A,B,C和D类每个都有1000个示例,而E类有10个示例,则只需从E类中再添加990个示例即可(只需复制它或复制品以及一些杂音).
  2. 成本敏感建模:更改不同类别的重要性(权重).
    这是您在代码中使用的方法,其中将类的重要性(权重)增加了最多5倍.
  1. Dataset Resampling: Make the classes balanced by changing the dataset size.
    For example, if you have 5 target classes(class A to E), and class A, B, C, and D have 1000 examples each and class E has 10 examples, you can simply add 990 more examples from class E(just copy it or copy and some noise to it).
  2. Cost-Sensitive Modeling: Change the importance(weight) of different classes.
    This is the method you have used in your code where you increased the importance(weight) of a class by a factor of at most 5.

回到您的问题,第一个解决方案与您的模型无关.您只需要检查是否能够更改数据集(将更多样本添加到样本较少的类中,或从样本大量的类中删除样本).对于第二种解决方案,由于您正在使用神经网络,因此必须更改损失函数公式.您可以定义多个超参数(类权重或重要性)并训练您的模型,并查看哪组参数效果更好.

Returning to your problem, the first solution is independent of your model. You just need to check if you are able to change the dataset(add more samples to classes with fewer samples or remove samples from classes with lots of samples). For the second solution, since you are working with a neural network, you have to change your loss function formula. You can define multiple hyperparameters(class weights or importance) and train your model and see which set of parameters work better.

所以要回答您的问题,是的,这是处罚的正确方法,但也许通过尝试不同的权重(而不是示例中的5)可以获得更好的准确性.另外,您可能想尝试数据集重采样.

So to answer your question, yes this is a right way to penalize but maybe you get better accuracy by trying different weights(instead of 5 in your example). Also, you might want to try dataset resampling.

有关更多信息,您可以参考此链接.

For more information, you can refer to this link.

这篇关于神经网络用于不均衡的多类别多标签分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆