我应该在交叉熵之前应用 softmax 吗? [英] shall I apply softmax before cross entropy?

查看:30
本文介绍了我应该在交叉熵之前应用 softmax 吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

pytorch 教程(https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py)训练一个卷积神经网络(CNN)在 CIFAR 数据集上.

The pytorch tutorial (https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py) trains a convolutional neural network (CNN) on a CIFAR dataset.

    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = nn.Conv2d(3, 6, 5)
            self.pool = nn.MaxPool2d(2, 2)
            self.conv2 = nn.Conv2d(6, 16, 5)
            self.fc1 = nn.Linear(16 * 5 * 5, 120)
            self.fc2 = nn.Linear(120, 84)
            self.fc3 = nn.Linear(84, 10)

        def forward(self, x):
            x = self.pool(F.relu(self.conv1(x)))
            x = self.pool(F.relu(self.conv2(x)))
            x = x.view(-1, 16 * 5 * 5)
            x = F.relu(self.fc1(x))
            x = F.relu(self.fc2(x))
            x = self.fc3(x)
            return x 

网络看起来不错,除了最后一层 fc3,它在没有 softmax 的情况下预测属于 10 个类别的概率.在计算交叉熵损失之前,我们不应该先应用 softmax 以确保 fc 层的输出在 0 和 1 之间并求和吗?

The network looks good except that the very last layer fc3, which predicts the probability of belonging to 10 classes without a softmax. Shouldn't we apply a softmax first to make sure the output of the fc layer is between 0 and 1 and sum before calculating cross-entropy loss?

我通过应用 softmax 和重新运行进行了测试,但准确度下降到了 35% 左右.这似乎违反直觉.什么解释?

I tested this by applying the softmax and rerunning, butvthe accuracy dropped to around 35%. This seems counterintuitive. What is the explanation?

推荐答案

PyTorch 中的 CrossEntropyLoss 已经用 Softmax 实现了:

CrossEntropyLoss in PyTorch is already implemented with Softmax:

https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss

该标准将 nn.LogSoftmax() 和 nn.NLLLoss() 组合在一个类中.

This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class.

问题第二部分的答案稍微复杂一些.降低准确性可能有多种原因.从理论上讲,由于您添加的 softmax 层可以以合理的准确度预测正确答案,因此下一层应该能够通过保留最后两层之间具有同一性的最大值来做同样的事情.虽然 softmax 再次对那些有界输出(0 和 1 之间)进行归一化,但它可能会改变它们的分布方式,但仍然可以保留最大值,从而保留预测的类别.

The answer to the second part of your question is a little more complicated. There can be multiple causes for reduction in accuracy. Theoretically speaking, since the softmax layer you added can predict the correct answer in a reasonable accuracy, the following layer should be able to do the same by preserving the maximum value with identity between the last two layers. Although the softmax normalizes those bounded outputs (between 0 and 1) again, it may change the way those are distributed, but still can preserve the maximum and therefore the class that is predicted.

然而,在实践中,情况有点不同.当你在输出层有一个双 softmax 时,你基本上改变了输出函数,它改变了传播到你的网络的梯度.由于其产生的梯度,具有交叉熵的 softmax 是首选的损失函数.您可以通过计算成本函数的梯度来向自己证明这一点,并说明每次激活"都会发生的情况.(softmax) 被限制在 0 和 1 之间.额外的 softmax behind"原始的只是将梯度与 0 和 1 之间的值相乘,从而减少了该值.这会影响权重的更新.也许它可以通过改变学习率来解决,但强烈不建议这样做.只需一个 softmax 就完成了.
请参阅 Michael Nielsen 的书,第 3 章,了解更深入的解释.

However, in practice, things are a little bit different. When you have a double softmax in the output layer, you basically change the output function in such way that it changes the gradients that are propagated to your network. The softmax with cross entropy is a preferred loss function due to the gradients it produces. You can prove it to yourself by computing the gradients of the cost function, and account for the fact that each "activation" (softmax) is bounded between 0 and 1. The additional softmax "behind" the original one just multiplies the gradients with values between 0 and 1 and thus reducing the value. This affects the updates to the weights. Maybe it can be fixed by changing the learning rate but this is strongly not suggested. Just have one softmax and you're done.
See Michael Nielsen's book, chapter 3 for more profound explanation on that.

这篇关于我应该在交叉熵之前应用 softmax 吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆