多标签分类的不平衡数据集 [英] Imbalanced Dataset for Multi Label Classification

查看:877
本文介绍了多标签分类的不平衡数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我在我创建的多标签数据集(约20000个样本)上训练了一个深度神经网络。我将Sigmoid转换为softmax并尝试最小化(使用Adam优化器):

  tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels = y_,logits = y_pred)

最后我得到了这个预测之王(相当恒定 ):

  Im1的预测:[0.59275776 0.08751075 0.37567005 0.1636796 0.42361438 0.08701646 0.38991812 0.54468459 0.34593087 0.82790571] 

Im2的预测:[0.52609032 0.07885984 0.45780018 0.04995904 0.32828355 0.07349177 0.35400775 0.36479294 0.30002621 0.84438241]

Im3的预测:[0.58714485 0.03258472 0.3349618 0.03199361 0.54665488 0.02271551 0.43719986 0.526 $ 1814814457 >

起初,我认为我有必要为每个班级找到一个阈值。



但是我注意到,例如,在我的20000个样本中,第一类出现的次数约为10800,所以比率为0.54,每次我的预测所依据的价值。因此,我认为我需要找到一种方法来解决突尼斯不平衡的数据集问题。



我考虑过减少数据集(欠采样)的发生次数相同对于每个班级,但只有26个样本对应于我的一个班级...这会使我失去很多样本...



我读到有关过采样或惩罚的信息甚至更多的类很少见,但并没有真正理解它是如何工作的。



有人可以分享一些有关这些方法的解释吗?



实际上,在Tensorflow上,有没有函数可以帮助这样做吗?



还有其他建议吗?



谢谢:)



PS:神经网络用于不平衡多级的多标签分类这篇文章提出了同样的问题,但没有答案!

解决方案

嗯,有10000一个类别的样本,而罕见类别的样本只有26个,这确实是一个问题。



但是,对我来说,您的经验似乎更像是输出什至看不到输入,因此网络只能学习您的输出分布。



要进行调试,我将创建一个简化的集合(仅用于此调试目的),每个类说26个样本,然后尝试过度拟合。如果您得到正确的预测,我的想法是错误的。但是,如果网络甚至无法检测到那些欠采样的过拟合样本,那么实际上这是一个体系结构/实现问题,而不是由于扭曲的分布(您将需要对其进行修复。但这不会像您当前的结果那样糟糕)。 / p>

So I trained a deep neural network on a multi label dataset I created (about 20000 samples). I switched softmax for sigmoid and try to minimize (using Adam optimizer) :

tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y_, logits=y_pred)

And I end up with this king of prediction (pretty "constant") :

Prediction for Im1 : [ 0.59275776  0.08751075  0.37567005  0.1636796   0.42361438  0.08701646 0.38991812  0.54468459  0.34593087  0.82790571]

Prediction for Im2 : [ 0.52609032  0.07885984  0.45780018  0.04995904  0.32828355  0.07349177 0.35400775  0.36479294  0.30002621  0.84438241]

Prediction for Im3 : [ 0.58714485  0.03258472  0.3349618   0.03199361  0.54665488  0.02271551 0.43719986  0.54638696  0.20344526  0.88144571]

At first, I thought I just neeeded to find a threshold value for each class.

But I noticed that, for instance, among my 20000 samples, the 1st class appears about 10800 so a 0.54 ratio and it the value around which my prediction is every time. So I think I need to find a way to tackle tuis "imbalanced datset" issue.

I thought about reducing my dataset (Undersampling) to have about the same number of occurence for each class but only 26 samples correspond to one of my classes... That would make me loose a lot of samples...

I read about oversampling or about penalizing even more the classes that are rare but did not really understood how it works.

Can someone share some explainations about these methods please ?

In practice, on Tensorflow, are there functions that help doing that ?

Any other suggestions ?

Thank you :)

PS: Neural Network for Imbalanced Multi-Class Multi-Label Classification This post raises the same problem but had no answer !

解决方案

Well, having 10000 samples in one class and just 26 in a rare class will be indeed a problem.

However, what you experience, to me, seems more like "outputs don't even see the inputs" and thus the net just learns your output distribution.

To debug this I would create a reduced set (just for this debugging purpose) with say 26 samples per class and then try to heavily overfit. If you get correct predictions my thought is wrong. But if the net cannot even detect those undersampled overfit samples then indeed it's an architecture/implementation problem and not due to the schewed distribution (which you will then need to fix. But it'll be not as bad as your current results).

这篇关于多标签分类的不平衡数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆