多类分类中否定示例的经验法则 [英] Rule of thumb for negative example in multi-class classification

查看:116
本文介绍了多类分类中否定示例的经验法则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在多类分类任务中,代表其他所有东西"的标签的样本数量应有多少经验法则?

Is there a rule of thumb for how big the number of samples should be for the label that represents "everything else" in a multi-class classification task?

示例:我想将输入分类为X类之一.当输入为以上都不是"时,X + 1类将激活.假设我的数据集包含来自10个阳性"类中每个类别的5,000个样本.对于表示未知"类的示例,我将使用可能在生产中找到的多个实际示例,但这些示例并非来自其他类.

Example: I want to classify my input as being one of X classes. The X + 1 class activates when the input is "none of the above." Suppose my dataset contains 5,000 samples from each of the 10 "positive" classes. For samples representing the "unknown" class, I'd use multiple realistic examples likely to be found in production, but that are not from the other classes.

这些否定例子的数目相对于其他分布应该有多少?

How big should the number of these negative examples be relative to the other distributions?

推荐答案

这也许有点题外话,但是无论如何,我认为没有一般的经验法则,它取决于您的问题,并且您的方法.

This is maybe a bit off-topic, but in any case, I don't think there is a general rule of thumb, it depends on your problem and your approach.

我会考虑以下因素:

  • 数据的性质.这有点抽象,但是您可以问自己是否希望其他所有"类的样本容易与实际类混淆.例如,如果要在动物的一般图像中检测狗或猫,则可能还有许多其他动物(例如狐狸)可能会使系统混乱,但是如果您的输入仅包含狗,猫或家具的图像,则可能不会很多.但是,这仅仅是一种直觉,在其他问题上可能还不清楚.
  • 您的模型.例如,在此答案中我提了一个相关的问题,我提到了一种在其余类的功能中对其他一切"进行建模的方法,因此您可以争辩说,即使输入内容不太相似(上一点),即使没有示例也是如此.因为其他所有类别"均未触发,因此其他"类别可能仅会起作用.其他技巧,例如为每个类别赋予不同的训练权重"(例如,根据您拥有的每个类别的实例数来计算),可以弥补不平衡的数据集.
  • 您的目标.显然,您希望您的系统是完美的,但是您可能会考虑是宁愿误报还是误报(例如,错过狗的形象还是说没有狗的情况会更糟).如果您希望输入内容主要由其他所有内容"的实例组成,则可以认为您的模型偏向于该类,或者出于这个原因,您可能要确保不要丢弃任何可能有趣的样本,这可能是有道理的.
  • The nature of the data. This is a bit abstract, but you can ask yourself whether you would expect samples from the "everything else" class to be easily confused with an actual class. For example, if you want to detect dogs or cats in general images of animals, there are probably many other animals (e.g. foxes) that may confuse the system, but if your input only has images of dogs, cats or furniture, maybe not so much. This is however an intuition only, and in other problems it may not be so clear.
  • Your model. For example, in this answer I gave to a related question I mention an approach to model the "everything else" in function of the rest of classes, so you could argue that, if inputs are not too similar (previous point), even with no examples of "everything else" it might just work, since none of the other classes are triggered. Other tricks, like giving different training "weights" to each class (e.g. computed in function of the number of instances you have of each one), may compensate for an unbalanced dataset.
  • Your goals. Obviously you want your system to be perfect, but you may consider whether you'd rather have false positives or false negatives (e.g. is it worse to miss an image of a dog or to say there's a dog when there's none). If you expect your input to be mostly composed of instances of "everything else", it may make sense that your model is biased towards that class, or maybe for that very reason you want to be sure you don't discard any potentially interesting sample.

不幸的是,告诉您是否做得好的唯一好方法是对代表性的测试数据集进行实验并获得良好的指标(混淆矩阵,每类精度/召回率等).

Unfortunately, the only good way of telling whether you are doing ok is experimenting and having good metrics over a representative test dataset (confusion matrix, per-class precision/recall, etc).

这篇关于多类分类中否定示例的经验法则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆