用于反向传播的激活函数的导数是什么? [英] What is a derivative of the activation function used for in backpropagation?

查看:334
本文介绍了用于反向传播的激活函数的导数是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读文档,他们说重量调整公式是这样的:

I am reading this document, and they stated that the weight adjustment formula is this:

新体重=旧体重+学习率*增量* df(e)/de *输入

new weight = old weight + learning rate * delta * df(e)/de * input

df(e)/de部分是激活函数的派生,激活函数通常是像tanh这样的S型函数.现在,这实际上是做什么用的?我们为什么还要相乘呢?为什么仅仅learning rate * delta * input还不够?

The df(e)/de part is the derivative of the activation function, which is usually a sigmoid function like tanh. Now, what is this actually for? Why are we even multiplying with that? Why isn't just learning rate * delta * input enough?

此问题紧随其后,并且与之密切相关:

This question came after this one and is closely related to it: Why must a nonlinear activation function be used in a backpropagation neural network?.

推荐答案

训练神经网络仅是指为权重矩阵中的每个单元查找值(其中对于具有一个隐藏层的NN,有两个),以使观察到的数据和预测数据之间的平方差最小化.实际上,包含两个权重矩阵的各个权重随每次迭代进行调整(它们的初始值通常设置为随机值).这也称为在线模型,与第一批模型不同,在第一批模型中,经过多次迭代后会调整权重.

Training a neural network just refers to finding values for every cell in the weight matrices (of which there are two for a NN having one hidden layer) such that the squared differences between the observed and predicted data are minimized. In practice, the individual weights comprising the two weight matrices are adjusted with each iteration (their initial values are often set to random values). This is also called the online model, as opposed to the batch one where weights are adjusted after a lot of iterations.

但是 应该如何调整权重 -即,哪个方向+/-?多少钱?

But how should the weights be adjusted--i.e., which direction +/-? And by how much?

这是衍生产品的来源. 衍生产品的大值将导致对相应权重的大调整. .这是有道理的,因为如果导数很大,则意味着您离极小值还很远.换句话说,权重是在每次迭代中沿总误差(观察与预测)所定义的成本函数表面上最陡下降(导数的最大值)的方向调整的.

That's where the derivative come in. A large value for the derivative will result in a large adjustment to the corresponding weight. This makes sense because if the derivative is large that means you are far from a minima. Put another way, weights are adjusted at each iteration in the direction of steepest descent (highest value of the derivative) on the cost function's surface defined by the total error (observed versus predicted).

计算出每个模式的误差后(从该迭代过程中NN预测的值中减去响应变量或输出矢量的实际值),权重矩阵中的每个权重将根据计算出的误差梯度进行调整

After the error on each pattern is computed (subtracting the actual value of the response varible or output vector from the value predicted by the NN during that iteration), each weight in the weight matrices is adjusted in proportion to the calculated error gradient.

由于误差计算始于NN的末尾(即在输出层通过从预测中减去观测值而得出)并进行到最前面,因此称为 backprop .

Because the error calculation begins at the end of the NN (i.e., at the output layer by subtracting observed from predicted) and proceeds to the front, it is called backprop.

更一般而言,优化技术使用了导数(对于多变量问题,则为 gradient )(对于反向传播,共轭梯度可能是最常见的).定位目标函数(又称​​损失)的最小值.

More generally, the derivative (or gradient for multivariable problems) is used by the optimization technique (for backprop, conjugate gradient is probably the most common) to locate minima of the objective (aka loss) function.

它是这样工作的:

一阶导数是曲线上的点,使得与之相切的线的斜率为0.

The first derivative is the point on a curve such that a line tangent to it has a slope of 0.

因此,如果您绕着由目标函数定义的3D曲面行走,并且走到坡度= 0的点,那么您位于底部-您已经找到了 minima (是否全局或局部).

So if you are walking around a 3D surface defined by the objective function and you walk to a point where slope = 0, then you are at the bottom--you have found a minima (whether global or local) for the function.

但是一阶导数比这更重要.如果您朝正确的方向前进,它还会告诉您 ,以使功能达到最低要求.

But the first derivative is more important than that. It also tells you if you are going in the right direction to reach the function minimum.

如果您想一想当曲线/曲面上的点向下移向函数最小值时切线的斜率会发生什么,就很容易明白为什么会这样.

It's easy to see why this is so if you think about what happens to the slope of the tangent line as the point on the curve/surface is moved down toward the function minimumn.

斜率(因此该点的函数导数的值)逐渐减小.换句话说,要使函数最小化,请遵循导数-即,如果值减小,则说明您朝正确的方向移动.

The slope (hence the value of the derivative of the function at that point) gradually decreases. In other words, to minimize a function, follow the derivative--i.e, if the value is decreasing then you are moving in the correct direction.

这篇关于用于反向传播的激活函数的导数是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆