试图了解可计算Torch中LogSoftMax输入的梯度wrt的代码 [英] Trying to understand code that computes the gradient wrt to the input for LogSoftMax in Torch

查看：312 发布时间：2020/5/6 11:54:05 mathematical-optimization torch gradient-descent softmax

本文介绍了试图了解可计算Torch中LogSoftMax输入的梯度wrt的代码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

代码来自: https://github. com/torch/nn/blob/master/lib/THNN/generic/LogSoftMax.c

我看不到这段代码如何计算到LogSoftMax模块输入的梯度w.r.t.我很困惑的是两个for循环在做什么.

I don't see how this code is computing the gradient w.r.t to the input for the module LogSoftMax. What I'm confused about is what the two for loops are doing.

for (t = 0; t < nframe; t++)
{
sum = 0;
gradInput_data = gradInput_data0 + dim*t;
output_data = output_data0 + dim*t;
gradOutput_data = gradOutput_data0 + dim*t;

for (d = 0; d < dim; d++)
  sum += gradOutput_data[d];

for (d = 0; d < dim; d++)
  gradInput_data[d] = gradOutput_data[d] - exp(output_data[d])*sum;
 }
}

推荐答案

在向前的时间(x =输入向量，y =输出向量，f = logsoftmax，i =第i个分量):

At forward time we have (with x = input vector, y = output vector, f = logsoftmax, i = i-th component):

yi = f(xi)
   = log( exp(xi) / sum_j(exp(xj)) )
   = xi - log( sum_j(exp(xj)) )

在计算f的雅可比Jf时(第i行):

When computing the jacobian Jf of f you have (i-th row):

dyi/dxi = 1 - exp(xi) / sum_j(exp(xj))

对于不同于i的k:

dyi/dxk = - exp(xk) / sum_j(exp(xj))

这给Jf:

1-E(x1)     -E(x2)     -E(x3)    ...
 -E(x1)    1-E(x2)     -E(x3)    ...
 -E(x1)     -E(x2)    1-E(x3)    ...
...

使用E(xi) = exp(xi) / sum_j(exp(xj))

如果我们将gradInput命名为w.r.t输入梯度，而将gradOutput命名为gradOutput梯度w.r.t输出，则反向传播给出(链式规则):

If we name gradInput the gradient w.r.t input and gradOutput the gradient w.r.t output the backpropagation gives (chain rule):

gradInputi = sum_j( gradOutputj . dyj/dxi )

这等效于:

gradInput = transpose(Jf) . gradOutput

最后给出第i个组件:

gradInputi = gradOutputi - E(xi) . sum_j( gradOutputj )

因此，第一个循环会预先计算sum_j( gradOutputj )，最后一个循环会计算上述项，即grad的第i个部分.输入-除了在Torch实施中缺少指数项的1 / sum_j(exp(xj))(上面的演算可能应该仔细检查，即使听起来正确并解释了当前的实施方式).

So the first loop pre-computes sum_j( gradOutputj ) and the last one the above term, i.e. i-th component of grad. input - except there is a missing 1 / sum_j(exp(xj)) for the exponential term in the Torch implementation (the above calculus should probably be double checked even though it sounds correct and explains the current implementation).

更新:缺少 1 / sum_j(exp(xj))术语没有问题.由于jacobian是根据 output 值计算的，并且由于此先前计算的输出正好是log-softmax分布，因此得出该分布的sum-exp为1:

UPDATE: there is no problem with the missing 1 / sum_j(exp(xj)) term. Since the the jacobian is computed on the output value, and since this formerly computed output is precisely a log-softmax distribution it comes that the sum-exp of this distribution is 1:

sum_j(exp(outputj)) = sum_j(exp( log(exp(inputj) / sum_k(exp(inputk) ))
                    = sum_j(         exp(inputj) / sum_k(exp(inputk)  )
                    = 1

因此，无需在实现中明确使用该术语，它给出了(对于x =输出):

So there is no need to explicit this term in the implementation, which gives (for x = output):

gradInputi = gradOutputi - exp(outputi) . sum_j( gradOutputj )

这篇关于试图了解可计算Torch中LogSoftMax输入的梯度wrt的代码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

试图了解可计算Torch中LogSoftMax输入的梯度wrt的代码 [英] Trying to understand code that computes the gradient wrt to the input for LogSoftMax in Torch

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

试图了解可计算Torch中LogSoftMax输入的梯度wrt的代码 [英] Trying to understand code that computes the gradient wrt to the input for LogSoftMax in Torch

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭