多类分类中负例的经验法则 [英] Rule of thumb for negative example in multi-class classification

查看:24
本文介绍了多类分类中负例的经验法则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于在多类分类任务中表示其他所有内容"的标签的样本数量应该有多大,是否有经验法则?

Is there a rule of thumb for how big the number of samples should be for the label that represents "everything else" in a multi-class classification task?

示例:我想将我的输入分类为 X 类之一.X + 1 类在输入为以上都不是"时激活.假设我的数据集包含来自 10 个正"类中的每一个的 5,000 个样本.对于代表未知"类的示例,我将使用多个可能在生产中找到的实际示例,但这些示例不是来自其他类.

Example: I want to classify my input as being one of X classes. The X + 1 class activates when the input is "none of the above." Suppose my dataset contains 5,000 samples from each of the 10 "positive" classes. For samples representing the "unknown" class, I'd use multiple realistic examples likely to be found in production, but that are not from the other classes.

这些负例的数量相对于其他分布应该有多大?

How big should the number of these negative examples be relative to the other distributions?

推荐答案

这可能有点离题,但无论如何,我认为没有一般的经验法则,这取决于您的问题和你的方法.

This is maybe a bit off-topic, but in any case, I don't think there is a general rule of thumb, it depends on your problem and your approach.

我会考虑以下因素:

  • 数据的性质.这有点抽象,但您可以问自己是否希望其他所有"类中的样本容易与实际类混淆.例如,如果您想在动物的一般图像中检测狗或猫,可能还有许多其他动物(例如狐狸)可能会使系统感到困惑,但如果您的输入只有狗、猫或家具的图像,则可能不是这样很多.然而,这只是一种直觉,在其他问题中可能不是那么清楚.
  • 您的模特.例如,在 这个答案我提出了一个相关问题我提到了一种方法来对其余类的功能中的其他所有内容"进行建模,因此您可以争辩说,如果输入不太相似(前一点),即使没有示例其他一切"它可能只是工作,因为没有其他类被触发.其他技巧,例如为每个类赋予不同的训练权重"(例如,根据每个类的实例数计算),可能会补偿不平衡的数据集.
  • 您的目标.显然,您希望您的系统是完美的,但您可能会考虑是否更愿意出现误报或误报(例如,错过狗的图像或在没有狗的情况下说有狗会更糟).如果您希望您的输入主要由其他所有内容"的实例组成,那么您的模型偏向于该类可能是有道理的,或者可能出于这个原因,您希望确保不会丢弃任何可能有趣的样本.
  • The nature of the data. This is a bit abstract, but you can ask yourself whether you would expect samples from the "everything else" class to be easily confused with an actual class. For example, if you want to detect dogs or cats in general images of animals, there are probably many other animals (e.g. foxes) that may confuse the system, but if your input only has images of dogs, cats or furniture, maybe not so much. This is however an intuition only, and in other problems it may not be so clear.
  • Your model. For example, in this answer I gave to a related question I mention an approach to model the "everything else" in function of the rest of classes, so you could argue that, if inputs are not too similar (previous point), even with no examples of "everything else" it might just work, since none of the other classes are triggered. Other tricks, like giving different training "weights" to each class (e.g. computed in function of the number of instances you have of each one), may compensate for an unbalanced dataset.
  • Your goals. Obviously you want your system to be perfect, but you may consider whether you'd rather have false positives or false negatives (e.g. is it worse to miss an image of a dog or to say there's a dog when there's none). If you expect your input to be mostly composed of instances of "everything else", it may make sense that your model is biased towards that class, or maybe for that very reason you want to be sure you don't discard any potentially interesting sample.

不幸的是,判断您是否做得很好的唯一好方法是在具有代表性的测试数据集(混淆矩阵、每类精度/召回率等)上进行试验并获得良好的指标.

Unfortunately, the only good way of telling whether you are doing ok is experimenting and having good metrics over a representative test dataset (confusion matrix, per-class precision/recall, etc).

这篇关于多类分类中负例的经验法则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆