为什么只在输出层使用 softmax 而不是在隐藏层? [英] Why use softmax only in the output layer and not in hidden layers?

查看:128
本文介绍了为什么只在输出层使用 softmax 而不是在隐藏层?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我见过的大多数分类任务的神经网络示例都使用 softmax 层作为输出激活函数.通常,其他隐藏单元使用 sigmoid、tanh 或 ReLu 函数作为激活函数.在这里使用 softmax 函数 - 据我所知 - 也可以在数学上解决.

  • 不使用 softmax 函数作为隐藏层激活函数的理论依据是什么?
  • 是否有关于此的任何出版物,可以引用的内容?

解决方案

我还没有找到任何关于为什么在隐藏层中使用 softmax 作为激活不是最好的主意的出版物(除了 Quora 问题,你可能已经阅读)但我将尝试解释为什么在这种情况下使用它不是最好的主意:

1.变量独立性: 投入了大量正则化和努力来保持变量独立、不相关且非常稀疏.如果您使用 softmax 层作为隐藏层 - 那么您将保持所有节点(隐藏变量)线性相关,这可能会导致许多问题和较差的泛化性.

2.训练问题: 试着想象一下,为了让您的网络更好地工作,您必须将隐藏层的一部分激活值降低一点.然后 - 您会自动地使其余部分具有更高级别的均值激活,这实际上可能会增加错误并损害您的训练阶段.

3.数学问题:通过对模型的激活创建约束,您会在没有任何逻辑解释的情况下降低模型的表达能力.在我看来,努力使所有激活都相同是不值得的.

4.批量归一化做得更好:人们可能会考虑这样一个事实,即网络的恒定平均输出可能对训练有用.但另一方面,一种称为批量归一化的技术已经被证明效果更好,而据报道,将 softmax 设置为隐藏层中的激活函数可能会降低学习的准确性和速度.>

Most examples of neural networks for classification tasks I've seen use the a softmax layer as output activation function. Normally, the other hidden units use a sigmoid, tanh, or ReLu function as activation function. Using the softmax function here would - as far as I know - work out mathematically too.

  • What are the theoretical justifications for not using the softmax function as hidden layer activation functions?
  • Are there any publications about this, something to quote?

解决方案

I haven't found any publications about why using softmax as an activation in a hidden layer is not the best idea (except Quora question which you probably have already read) but I will try to explain why it is not the best idea to use it in this case :

1. Variables independence : a lot of regularization and effort is put to keep your variables independent, uncorrelated and quite sparse. If you use softmax layer as a hidden layer - then you will keep all your nodes (hidden variables) linearly dependent which may result in many problems and poor generalization.

2. Training issues : try to imagine that to make your network working better you have to make a part of activations from your hidden layer a little bit lower. Then - automaticaly you are making rest of them to have mean activation on a higher level which might in fact increase the error and harm your training phase.

3. Mathematical issues : by creating constrains on activations of your model you decrease the expressive power of your model without any logical explaination. The strive for having all activations the same is not worth it in my opinion.

4. Batch normalization does it better : one may consider the fact that constant mean output from a network may be useful for training. But on the other hand a technique called Batch Normalization has been already proven to work better, whereas it was reported that setting softmax as activation function in hidden layer may decrease the accuracy and the speed of learning.

这篇关于为什么只在输出层使用 softmax 而不是在隐藏层?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆