将神经网络输出限制为训练类的子集 [英] Limit neural network output to subset of trained classes

查看:26
本文介绍了将神经网络输出限制为训练类的子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以将向量传递给训练有素的神经网络,使其仅从训练识别的类的子集中进行选择.例如,我有一个经过训练可以识别数字和字母的网络,但我知道我接下来运行它的图像不会包含小写字母(例如序列号的图像).然后我向它传递一个向量,告诉它不要猜测任何小写字母.由于这些类是互斥的,因此网络以 softmax 函数结束.以下只是我想尝试但没有真正奏效的示例.

Is it possible to pass a vector to a trained neural network so it only chooses from a subset of the classes it was trained to recognize. For example, I have a network trained to recognize numbers and letters, but I know that the images I'm running it on next will not contain lowercase letters (Such as images of serial numbers). Then I pass it a vector telling it not to guess any lowercase letters. Since the classes are exclusive the network ends in a softmax function. Following are just examples of what I'd thought of trying but none really work.

import numpy as np

def softmax(arr):
    return np.exp(arr)/np.exp(arr).sum()

#Stand ins for previous layer/NN output and vector of allowed answers.
output = np.array([ 0.15885351,0.94527385,0.33977026,-0.27237907,0.32012873,
       0.44839673,-0.52375875,-0.99423903,-0.06391236,0.82529586])
restrictions = np.array([1,1,0,0,1,1,1,0,1,1])

#Ideas -----

'''First: Multilpy restricted before sending it through softmax.
I stupidly tried this one.'''
results = softmax(output*restrictions)

'''Second: Multiply the results of the softmax by the restrictions.'''
results = softmax(output)
results = results*restrictions

'''Third: Remove invalid entries before calculating the softmax.'''
result = output*restrictions
result[result != 0] = softmax(result[result != 0])

所有这些都有问题.第一个导致无效选择默认为:

All of these have issues. The first one causes invalid choices to default to:

1/np.exp(arr).sum()

由于 softmax 的输入可能为负,这会增加无效选择的概率并使答案变得更糟.(在我尝试之前应该研究它.)

since inputs to softmax can be negative this can raise the probability given to an invalid choice and make the answer worse. (Should've looked into it before I tried it.)

第二个和第三个都有类似的问题,因为他们会等到给出答案之前才应用限制.例如,如果网络正在查看字母 l,但它开始确定它是数字 1,那么直到使用这些方法结束时,这才会被纠正.因此,如果它正在以 0.80 的概率给出 1 的输出,但随后删除了此选项,则似乎剩余的选项将重新分配,并且最高的有效答案不会像 80% 那样有信心.剩下的选项最终会更加同质.我想说的一个例子:

The second and third both have similar issues in that they wait until right before an answer is given to apply the restriction. For example, if the network is looking at the letter l, but it starts to determine that it's the number 1, this won't be corrected until the very end with these methods. So if it was on it's way to giving the output of 1 with .80 probability but then this option removed it seems the remaining options will redistribute and the highest valid answer won't be as confident as 80%. The remaining options end up a lot more homogeneous. An example of what I'm trying to say:

output
Out[75]: array([ 5.39413513,  3.81445419,  3.75369546,  1.02716988,  0.39189373])

softmax(output)
Out[76]: array([ 0.70454877,  0.14516581,  0.13660832,  0.00894051,  0.00473658])

softmax(output[1:])
Out[77]: array([ 0.49133596,  0.46237183,  0.03026052,  0.01603169])

(排列数组是为了让它更容易.)在原始输出中,softmax 给出 0.70 的答案是 [1,0,0,0,0] 但如果这是一个无效的答案并因此删除了重新分配如何分配 4 个剩余选项的概率低于 50%,这很容易由于太低而无法使用而被忽略.

(Arrays were ordered to make it easier.) In the original output the softmax gives .70 that the answer is [1,0,0,0,0] but if that's an invalid answer and thus removed the redistribution how assigns the 4 remaining options with under 50% probability which could easily be ignored as too low to use.

我之前考虑过将一个向量作为另一个输入传递到网络中,但我不知道如何做到这一点而不需要它了解向量告诉它做什么,我认为这会增加训练所需的时间.

I've considered passing a vector into the network earlier as another input but I'm not sure how to do this without requiring it to learn what the vector is telling it to do, which I think would increase time required to train.

我在评论中写的太多了,所以我只会在这里发布更新.我最终确实尝试将限制作为网络的输入.我采用了一个热编码的答案,并随机添加了额外的启用类来模拟答案键并确保正确的答案始终在键中.当密钥的启用类别很少时,网络会严重依赖它,并且会干扰从图像中学习特征.当键有很多启用的类别时,它似乎完全忽略了键.这可能是一个需要优化的问题、我的网络架构问题,或者只是需要对训练进行调整,但我从来没有绕过解决方案.

I was writing way too much in the comments so I'll just post updates here. I did eventually try giving the restrictions as an input to the network. I took the one hot-encoded answer and randomly added extra enabled classes to simulate an answer key and ensure the correct answer was always in the key. When the key had very few enabled categories the network relied heavily on it and it interfered with learning features from the image. When the key had a lot of enabled categories it seemingly ignored the key completely. This could have been a problem that needed optimized, issues with my network architecture, or just needed a tweak to training but I never got around the the solution.

当我最终减去 np.inf 而不是乘以 0 时,我确实发现删除答案和归零几乎相同.我知道集成,但正如对第一个的评论中提到的我的网络正在处理 CJK 字符(字母表只是为了使示例更容易)并且有 3000 多个类.网络已经过于庞大,这就是我想研究这种方法的原因.对每个单独的类别使用二元网络是我没有想到的,但 3000 多个网络似乎也有问题(如果我理解你说的是对的),尽管我稍后可能会研究它.

I did find out that removing answers and zeroing were almost the same when I eventually subtracted np.inf instead of multiplying by 0. I was aware of ensembles but as mentioned in a comment to the first response my network was dealing with CJK characters (alphabet was just to make example easier) and had 3000+ classes. The network was already overly bulky which is why I wanted to look into this method. Using binary networks for each individual category was something I hadn't thought of but 3000+ networks seems problematic too (if I understood what you were saying correctly) though I may look into it later.

推荐答案

首先,我将粗略地浏览您列出的可用选项,并添加一些具有优缺点的可行替代方案.构建这个答案有点困难,但我希望你能明白我想要提出的内容:

First of all, I will loosely go through available options you have listed and add some viable alternatives with the pros and cons. It's kinda hard to structure this answer but I hope you'll get what I'm trying to put out:

显然可能会给您所写的归零条目提供更高的机会,一开始似乎是一种错误的方法.

Obviously may give higher chance to the zeroed-out entries as you have written, at seems like a false approach at the beginning.

替代方法:smallest logit 值替换不可能的值.这个类似于softmax(output[1:]),尽管网络对结果的不确定性更大.示例 pytorch 实现:

Alternative: replace impossible values with smallest logit value. This one is similar to softmax(output[1:]), though the network will be even more uncertain about the results. Example pytorch implementation:

import torch

logits = torch.Tensor([5.39413513, 3.81445419, 3.75369546, 1.02716988, 0.39189373])
minimum, _ = torch.min(logits, dim=0)
logits[0] = minimum
print(torch.nn.functional.softmax(logits))

产生:

tensor([0.0158, 0.4836, 0.4551, 0.0298, 0.0158])

讨论

  • 引用你的话:在原始输出中,softmax 给出了 0.70 的答案是 [1,0,0,0,0] 但如果这是一个无效的答案并因此删除了重新分配如何分配剩下的 4概率低于 50% 的选项很容易被忽略,因为它太低而无法使用."
  • 是的,你这样做是对的.更重要的是,这个类的实际概率实际上要低得多,大约 14%(tensor([0.7045, 0.1452, 0.1366, 0.0089, 0.0047])).通过手动更改输出,您实际上是在破坏此 NN 已学习的属性(及其输出分布),从而使您的某些部分计算变得毫无意义.这指向了这次赏金中提到的另一个问题:

    Yes, and you would be in the right when doing that. Even more so, the actual probabilities for this class are actually far lower, around 14% (tensor([0.7045, 0.1452, 0.1366, 0.0089, 0.0047])). By manually changing the output you are essentially destroying the properties this NN has learned (and it's output distribution) rendering some part of your computations pointless. This points to another problem stated in the bounty this time:

    我可以想象这可以通过多种方式解决:

    I can imagine this being solved in multiple ways:

    创建多个神经网络并通过对以 argmax 结尾的 logits 求和(或 softmax 然后是 `argmax)来集成它们.3 种不同模型不同预测的假设情况:

    Create multiple neural networks and ensemble them by summing logits taking argmax at the end (or softmax and then `argmax). Hypothetical situation with 3 different models with different predictions:

    import torch
    
    predicted_logits_1 = torch.Tensor([5.39413513, 3.81419, 3.7546, 1.02716988, 0.39189373])
    predicted_logits_2 = torch.Tensor([3.357895, 4.0165, 4.569546, 0.02716988, -0.189373])
    predicted_logits_3 = torch.Tensor([2.989513, 5.814459, 3.55369546, 3.06988, -5.89473])
    
    combined_logits = predicted_logits_1 + predicted_logits_2 + predicted_logits_3
    print(combined_logits)
    print(torch.nn.functional.softmax(combined_logits))
    

    这会给我们在 softmax 之后的以下概率:

    This would gives us the following probabilities after softmax:

    [0.11291057 0.7576356 0.1293983 0.00005554 0.]

    (注意第一类现在是最有可能的)

    您可以使用 bootstrap 聚合 和其他集成技术来改进预测.这种方法使分类决策表面更平滑,并修复了分类器之间的相互错误(鉴于它们的预测差异很大).需要很多帖子才能更详细地描述(或者需要针对特定​​问题的单独问题),这里这里一些可能会让你开始.

    You can use bootstrap aggregating and other ensembling techniques to improve predictions. This approach makes the classifying decision surface smoother and fixes mutual errors between classifiers (given their predictions vary quite a lot). It would take many posts to describe in any greater detail (or separate question with specific problem would be needed), here or here are some which might get you started.

    我仍然不会将这种方法与手动选择输出混合使用.

    如果您可以将其分布在多个 GPU 上,这种方法可能会产生更好的推理时间,甚至可能会产生更好的训练时间.

    This approach might yield better inference time and maybe even better training time if you can distribute it over multiple GPUs.

    基本上,您的每个类都可以出现 (1) 或不存在 (0).原则上,您可以为 N 个类别训练 N 个神经网络,每个类别输出一个无界数(logit).这个单一的数字告诉网络是否认为这个例子应该被归类为它的类.

    Basically, each class of yours can either be present (1) or absent (0). In principle you could train N neural networks for N classes, each outputting a single unbounded number (logit). This single number tells whether the network thinks this example should be classified as it's class or not.

    如果您确定某些类不会成为结果肯定您不运行负责此类检测的网络.从所有网络(或网络子集)获得预测后,您选择最高值(如果使用 sigmoid 激活,则选择最高概率,尽管这会浪费计算).

    If you are sure certain class won't be the outcome for sure you do not run network responsible for this class detection. After obtaining predictions from all the networks (or subset of networks), you choose the highest value (or highest probability if you use sigmoid activation, though it would be computationally wasteful).

    额外的好处是所述网络的简单性(更容易训练和微调)和简单的switch-like行为(如果需要).

    Additional benefit would be simplicity of said networks (easier training and fine-tuning) and easy switch-like behavior if needed.

    如果我是您,我会采用 2.2 中概述的方法,因为您可以轻松地为自己节省一些推理时间,并允许您以明智的方式选择输出".

    If I were you I would go with the approach outlined in 2.2 as you could save yourself some inference time easily and would allow you to "choose outputs" in a sensible manner.

    如果这种方法还不够,您可以考虑 N 个网络集成,因此混合 2.22.1,一些引导程序或其他集成技术.这也应该可以提高您的准确性.

    If this approach is not enough, you may consider N ensembles of networks, so a mix of 2.2 and 2.1, some bootstrap or other ensembling techniques. This should improve your accuracy as well.

    这篇关于将神经网络输出限制为训练类的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆