将神经网络输出限制为受过训练的课程的子集 [英] Limit neural network output to subset of trained classes

查看:79
本文介绍了将神经网络输出限制为受过训练的课程的子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有可能将向量传递到经过训练的神经网络,以便它仅从经过训练的识别类的子集中进行选择.例如,我的网络经过训练可以识别数字和字母,但是我知道接下来要运行的图像将不包含小写字母(例如序列号的图像).然后我给它传递一个向量,告诉它不要猜测任何小写字母.由于这些类是互斥的,因此网络以softmax函数结尾.以下只是我想尝试的示例,但没有一个真正起作用.

Is it possible to pass a vector to a trained neural network so it only chooses from a subset of the classes it was trained to recognize. For example, I have a network trained to recognize numbers and letters, but I know that the images I'm running it on next will not contain lowercase letters (Such as images of serial numbers). Then I pass it a vector telling it not to guess any lowercase letters. Since the classes are exclusive the network ends in a softmax function. Following are just examples of what I'd thought of trying but none really work.

import numpy as np

def softmax(arr):
    return np.exp(arr)/np.exp(arr).sum()

#Stand ins for previous layer/NN output and vector of allowed answers.
output = np.array([ 0.15885351,0.94527385,0.33977026,-0.27237907,0.32012873,
       0.44839673,-0.52375875,-0.99423903,-0.06391236,0.82529586])
restrictions = np.array([1,1,0,0,1,1,1,0,1,1])

#Ideas -----

'''First: Multilpy restricted before sending it through softmax.
I stupidly tried this one.'''
results = softmax(output*restrictions)

'''Second: Multiply the results of the softmax by the restrictions.'''
results = softmax(output)
results = results*restrictions

'''Third: Remove invalid entries before calculating the softmax.'''
result = output*restrictions
result[result != 0] = softmax(result[result != 0])

所有这些都有问题.第一个导致无效选择默认为:

All of these have issues. The first one causes invalid choices to default to:

1/np.exp(arr).sum()

因为softmax的输入可能为负,所以这可能会增加赋予无效选择的可能性,并使答案更糟. (在尝试之前,应该已经研究过了.)

since inputs to softmax can be negative this can raise the probability given to an invalid choice and make the answer worse. (Should've looked into it before I tried it.)

第二个和第三个都有类似的问题,因为它们要等到给出应用限制的答案之前.例如,如果网络正在查看字母l,但它开始确定它是数字1,则直到使用这些方法结束时才对其进行更正.因此,如果采用这种方式,则以.80的概率给出1的输出,但随后删除了此选项,则似乎剩余的选项将重新分配,并且最高的有效答案不会像80%那样可信.剩下的选项最终变得更加同质. 我想说的一个例子:

The second and third both have similar issues in that they wait until right before an answer is given to apply the restriction. For example, if the network is looking at the letter l, but it starts to determine that it's the number 1, this won't be corrected until the very end with these methods. So if it was on it's way to giving the output of 1 with .80 probability but then this option removed it seems the remaining options will redistribute and the highest valid answer won't be as confident as 80%. The remaining options end up a lot more homogeneous. An example of what I'm trying to say:

output
Out[75]: array([ 5.39413513,  3.81445419,  3.75369546,  1.02716988,  0.39189373])

softmax(output)
Out[76]: array([ 0.70454877,  0.14516581,  0.13660832,  0.00894051,  0.00473658])

softmax(output[1:])
Out[77]: array([ 0.49133596,  0.46237183,  0.03026052,  0.01603169])

(对数组进行了排序以使其更容易.) 在原始输出中,softmax给出.70的答案是[1,0,0,0,0],但是如果这是无效的答案,则删除了重新分配,如何以50%以下的概率分配剩余的4个选项,这很容易做到由于太低而无法使用.

(Arrays were ordered to make it easier.) In the original output the softmax gives .70 that the answer is [1,0,0,0,0] but if that's an invalid answer and thus removed the redistribution how assigns the 4 remaining options with under 50% probability which could easily be ignored as too low to use.

我已经考虑过将向量作为另一种输入传递到网络中,但是我不确定如何在不要求它学习向量告诉它要做的事情的情况下进行此操作,我认为这会增加训练所需的时间

I've considered passing a vector into the network earlier as another input but I'm not sure how to do this without requiring it to learn what the vector is telling it to do, which I think would increase time required to train.

我在评论中写的太多了,所以我只在这里发布更新.最终,我确实尝试将这些限制作为网络的输入.我采用了一个热编码答案,并随机添加了额外的启用类来模拟答案键,并确保正确答案始终在键中.当密钥只有很少的启用类别时,网络会严重依赖它,并且会干扰从图像中学习功能.当密钥具有很多启用的类别时,它似乎完全忽略了密钥.这可能是一个需要优化的问题,或者是我的网络体系结构出现了问题,或者只是需要进行一些调整才能进行培训,但我始终没有解决该问题的方法.

I was writing way too much in the comments so I'll just post updates here. I did eventually try giving the restrictions as an input to the network. I took the one hot-encoded answer and randomly added extra enabled classes to simulate an answer key and ensure the correct answer was always in the key. When the key had very few enabled categories the network relied heavily on it and it interfered with learning features from the image. When the key had a lot of enabled categories it seemingly ignored the key completely. This could have been a problem that needed optimized, issues with my network architecture, or just needed a tweak to training but I never got around the the solution.

我确实发现,当我最终减去np.inf而不是乘以0时,删除答案和清零几乎是相同的.我知道了合奏,但是正如我在处理CJK的第一个响应的评论中提到的那样字符(为了使示例更容易使用字母),并且具有3000多个类.网络已经非常庞大,这就是为什么我想研究这种方法的原因.我没想到要为每个单独的类别使用二进制网络,但是3000多个网络似乎也很成问题(如果我理解正确的话),尽管我稍后可能会研究它.

I did find out that removing answers and zeroing were almost the same when I eventually subtracted np.inf instead of multiplying by 0. I was aware of ensembles but as mentioned in a comment to the first response my network was dealing with CJK characters (alphabet was just to make example easier) and had 3000+ classes. The network was already overly bulky which is why I wanted to look into this method. Using binary networks for each individual category was something I hadn't thought of but 3000+ networks seems problematic too (if I understood what you were saying correctly) though I may look into it later.

推荐答案

首先,我将轻松地介绍您列出的可用选项,并添加一些可行和可行的替代方案.构造此答案有点困难,但我希望您能得到我要提出的建议:

First of all, I will loosely go through available options you have listed and add some viable alternatives with the pros and cons. It's kinda hard to structure this answer but I hope you'll get what I'm trying to put out:

很显然,如您所写,归零条目可能会具有更高的机会,一开始似乎是一种错误的方法.

Obviously may give higher chance to the zeroed-out entries as you have written, at seems like a false approach at the beginning.

替代:用smallest logit值替换不可能的值.这与softmax(output[1:])相似,但是网络对结果的不确定性更大.示例pytorch实现:

Alternative: replace impossible values with smallest logit value. This one is similar to softmax(output[1:]), though the network will be even more uncertain about the results. Example pytorch implementation:

import torch

logits = torch.Tensor([5.39413513, 3.81445419, 3.75369546, 1.02716988, 0.39189373])
minimum, _ = torch.min(logits, dim=0)
logits[0] = minimum
print(torch.nn.functional.softmax(logits))

产生:

tensor([0.0158, 0.4836, 0.4551, 0.0298, 0.0158])

讨论

  • 引用您:"在原始输出中,softmax给出.70的答案是[1,0,0,0,0],但是如果这是无效的答案,则删除了重新分配如何分配剩余的4可能性低于50%的选项,由于使用率太低而很容易被忽略."
  • Discussion

    • Citing you: "In the original output the softmax gives .70 that the answer is [1,0,0,0,0] but if that's an invalid answer and thus removed the redistribution how assigns the 4 remaining options with under 50% probability which could easily be ignored as too low to use."
    • 是的,您这样做将是正确的.更重要的是,此类的实际概率实际上要低得多,大约在14%(tensor([0.7045, 0.1452, 0.1366, 0.0089, 0.0047]))左右.通过手动更改输出,实际上是在破坏该NN所学习的属性(及其输出分布),从而使计算的某些部分变得毫无意义.这指向了这次赏金中提到的另一个问题:

      Yes, and you would be in the right when doing that. Even more so, the actual probabilities for this class are actually far lower, around 14% (tensor([0.7045, 0.1452, 0.1366, 0.0089, 0.0047])). By manually changing the output you are essentially destroying the properties this NN has learned (and it's output distribution) rendering some part of your computations pointless. This points to another problem stated in the bounty this time:

      我可以想象这是通过多种方式解决的:

      I can imagine this being solved in multiple ways:

      创建多个神经网络,并通过对最后加argmax的对数(或softmax然后是"argmax")进行加法运算来整合它们. 3种不同模型不同的预测的假设情况:

      Create multiple neural networks and ensemble them by summing logits taking argmax at the end (or softmax and then `argmax). Hypothetical situation with 3 different models with different predictions:

      import torch
      
      predicted_logits_1 = torch.Tensor([5.39413513, 3.81419, 3.7546, 1.02716988, 0.39189373])
      predicted_logits_2 = torch.Tensor([3.357895, 4.0165, 4.569546, 0.02716988, -0.189373])
      predicted_logits_3 = torch.Tensor([2.989513, 5.814459, 3.55369546, 3.06988, -5.89473])
      
      combined_logits = predicted_logits_1 + predicted_logits_2 + predicted_logits_3
      print(combined_logits)
      print(torch.nn.functional.softmax(combined_logits))
      

      softmax之后,这将为我们提供以下概率:

      This would gives us the following probabilities after softmax:

      [0.11291057 0.7576356 0.1293983 0.00005554 0.]

      (请注意,现在头等舱最有可能)

      您可以使用引导聚合和其他集成技术来改善预测.这种方法使分类决策表面更加平滑,并修复了分类器之间的相互误差(假设它们的预测相差很大).这将需要很多帖子来进行更详细的描述(或者需要单独的问题以及特定的问题), 此处是一些可能会帮助您入门.

      You can use bootstrap aggregating and other ensembling techniques to improve predictions. This approach makes the classifying decision surface smoother and fixes mutual errors between classifiers (given their predictions vary quite a lot). It would take many posts to describe in any greater detail (or separate question with specific problem would be needed), here or here are some which might get you started.

      仍然我不会将这种方法与手动选择输出混合使用.

      如果您可以将其分布在多个GPU上,则此方法可能会产生更好的推理时间,甚至可能会得到更好的训练时间.

      This approach might yield better inference time and maybe even better training time if you can distribute it over multiple GPUs.

      基本上,您的每个班级可以存在(1)或不存在(0).原则上,您可以为N类训练N神经网络,每个神经网络输出一个无界数(logit).这个数字告诉网络是否认为此示例应归类为该类.

      Basically, each class of yours can either be present (1) or absent (0). In principle you could train N neural networks for N classes, each outputting a single unbounded number (logit). This single number tells whether the network thinks this example should be classified as it's class or not.

      如果您确定某些课程不会成为结果,那么您不会运行负责该课程检测的网络. 从所有网络(或网络子集)获得预测后,选择最高值(或使用sigmoid激活的可能性最高,尽管这会在计算上造成浪费).

      If you are sure certain class won't be the outcome for sure you do not run network responsible for this class detection. After obtaining predictions from all the networks (or subset of networks), you choose the highest value (or highest probability if you use sigmoid activation, though it would be computationally wasteful).

      另外的好处是所述网络的简单性(易于培训和微调),并且在需要时易于实现switch-like行为.

      Additional benefit would be simplicity of said networks (easier training and fine-tuning) and easy switch-like behavior if needed.

      如果我是我,我会采用 2.2 中概述的方法,因为您可以轻松地节省一些推理时间,并允许您以明智的方式选择输出".

      If I were you I would go with the approach outlined in 2.2 as you could save yourself some inference time easily and would allow you to "choose outputs" in a sensible manner.

      如果这种方法还不够,则可以考虑网络的N集成,因此可以使用 2.2 2.1 的混合,一些引导程序或其他集成技术.这也可以提高您的准确性.

      If this approach is not enough, you may consider N ensembles of networks, so a mix of 2.2 and 2.1, some bootstrap or other ensembling techniques. This should improve your accuracy as well.

      这篇关于将神经网络输出限制为受过训练的课程的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆