为什么只在输出层而不是隐藏层中使用softmax? [英] Why use softmax only in the output layer and not in hidden layers?

查看:756
本文介绍了为什么只在输出层而不是隐藏层中使用softmax?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我见过的用于分类任务的神经网络的大多数示例都使用softmax层作为输出激活函数.通常,其他隐藏单元使用S型,tanh或ReLu函数作为激活函数.据我所知,在这里使用softmax函数也会在数学上得出结果.

  • 不使用softmax函数作为隐藏层激活函数的理论依据是什么?
  • 是否有关于此的出版物,需要引用吗?

解决方案

我还没有找到关于为什么在隐藏层中使用softmax作为激活不是最好的主意的出版物(除了 Quora 您可能会问的问题已经阅读过),但我将尝试解释为什么在这种情况下使用它不是最好的主意:

1.变量独立性:进行了大量的正则化和工作量,以使变量保持独立,不相关且稀疏.如果将softmax层用作隐藏层-那么您将使所有节点(隐藏变量)保持线性相关,这可能会导致许多问题和较差的泛化性.

2.培训问题::尝试想象一下,要使您的网络更好地工作,您必须降低隐藏层的激活量.然后-自动地使它们剩下的部分具有更高水平的平均激活能力,这实际上可能会增加错误并损害您的训练阶段.

3.数学问题::通过对模型的激活进行约束,可以在没有任何逻辑解释的情况下降低模型的表达能力.在我看来,使所有激活都相同的努力是不值得的.

4.批量归一化的效果更好::人们可能会考虑这样一个事实,即网络中恒定的均值输出可能对训练有用.但是,另一方面,已经证明一种称为批量标准化"的技术可以更好地工作,而据报道,将softmax设置为隐藏层中的激活函数可能会降低准确性和学习速度.

Most examples of neural networks for classification tasks I've seen use the a softmax layer as output activation function. Normally, the other hidden units use a sigmoid, tanh, or ReLu function as activation function. Using the softmax function here would - as far as I know - work out mathematically too.

  • What are the theoretical justifications for not using the softmax function as hidden layer activation functions?
  • Are there any publications about this, something to quote?

解决方案

I haven't found any publications about why using softmax as an activation in a hidden layer is not the best idea (except Quora question which you probably have already read) but I will try to explain why it is not the best idea to use it in this case :

1. Variables independence : a lot of regularization and effort is put to keep your variables independent, uncorrelated and quite sparse. If you use softmax layer as a hidden layer - then you will keep all your nodes (hidden variables) linearly dependent which may result in many problems and poor generalization.

2. Training issues : try to imagine that to make your network working better you have to make a part of activations from your hidden layer a little bit lower. Then - automaticaly you are making rest of them to have mean activation on a higher level which might in fact increase the error and harm your training phase.

3. Mathematical issues : by creating constrains on activations of your model you decrease the expressive power of your model without any logical explaination. The strive for having all activations the same is not worth it in my opinion.

4. Batch normalization does it better : one may consider the fact that constant mean output from a network may be useful for training. But on the other hand a technique called Batch Normalization has been already proven to work better, whereas it was reported that setting softmax as activation function in hidden layer may decrease the accuracy and the speed of learning.

这篇关于为什么只在输出层而不是隐藏层中使用softmax?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆