根据“坏"缩放损失值.在咖啡中 [英] scale the loss value according to "badness" in caffe

查看:78
本文介绍了根据“坏"缩放损失值.在咖啡中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想基于训练期间当前预测"与正确标签"的接近/远近来缩放每个图像的损失值.例如,如果正确的标签是猫",而网络认为它是狗",则罚金(损失)应小于网络认为它是汽车"的情况.

I want to scale the loss value of each image based on how close/far is the "current prediction" to the "correct label" during the training. For example if the correct label is "cat" and the network think it is "dog" the penalty (loss) should be less than the case if the network thinks it is a "car".

我的工作方式如下:

1-我定义了标签之间距离的矩阵,
2-将该矩阵作为底部传递到"softmaxWithLoss"层,
3-将每个log(prob)乘以该值,以根据forward_cpu

1- I defined a matrix of the distance between the labels,
2- pass that matrix as a bottom to the "softmaxWithLoss" layer,
3- multiply each log(prob) to this value to scale the loss according to badness in forward_cpu

但是,我不知道在backward_cpu部分中应该做什么.我知道必须更改渐变(bottom_diff),但不太确定如何在此处合并比例值.根据数学运算,我必须按比例缩放渐变(因为它只是一个比例),但不知道如何.

However I do not know what should I do in the backward_cpu part. I understand the gradient (bottom_diff) has to be changed but not quite sure, how to incorporate the scale value here. According to the math I have to scale the gradient by the scale (because it is just an scale) but don't know how.

另外,如果我没记错的话,似乎在caffe中有一个名为"InfoGainLoss"的loosLayer可以完成非常相似的工作,但是该层的后半部分有点令人困惑:

Also, seems like there is loosLayer in caffe called "InfoGainLoss" that does very similar job if I am not mistaken, however the backward part of this layer is a little confusing:

bottom_diff[i * dim + j] = scale * infogain_mat[label * dim + j] / prob;

我不确定为什么将infogain_mat[]除以prob而不是乘以!如果我将身份矩阵用于infogain_mat,难道它不应该在向前和向后都像softmax损失一样?

I am not sure why infogain_mat[] is divide by prob rather than being multiply by! If I use identity matrix for infogain_mat isn't it supposed to act like softmax loss in both forward and backward?

如果有人可以给我一些指示,将不胜感激.

It will be highly appreciated if someone can give me some pointers.

推荐答案

您正确地观察到为log(prob)所做的缩放正是"InfogainLoss"层所做的(您可以阅读有关它的更多信息此处此处 ).

You are correct in observing that the scaling you are doing for the log(prob) is exactly what "InfogainLoss" layer is doing (You can read more about it here and here).

对于导数(反向传播):此层计算出的损耗为

As for the derivative (back-prop): the loss computed by this layer is

L = - sum_j infogain_mat[label * dim + j] * log( prob(j) )

如果相对于prob(j)(这是该层的输入变量)来区分该表达式,您会注意到log(x)的派生词是1/x,这就是为什么看到这一点

If you differentiate this expression with respect to prob(j) (which is the input variable to this layer), you'll notice that the derivative of log(x) is 1/x this is why you see that

dL/dprob(j) = - infogain_mat[label * dim + j] / prob(j) 

现在,为什么在"SoftmaxWithLoss"层的反向支持中看不到类似的表达式?
好吧,正如该层的名称所暗示的那样,它实际上是两层的组合:用于从分类器输出 的类概率计算的softmax和位于其之上的对数丢失层.结合这两层,可以对梯度进行更强的数值估计.
我对"InfogainLoss"层进行了一些处理,我发现有时prob(j)的值可能很小,从而导致梯度的估算不稳定.

Now, why don't you see similar expression in the back-prop of "SoftmaxWithLoss" layer?
well, as the name of that layer suggests it is actually a combination of two layers: softmax that computes class probabilities from classifiers outputs and a log loss layer on top of it. Combining these two layer enables a more numerically robust estimation of the gradients.
Working a little with "InfogainLoss" layer I noticed that sometimes prob(j) can have a very small value leading to unstable estimation of the gradients.

以下是关于原始预测(x)的"SoftmaxWithLoss""InfogainLoss"层的前进和后退通过的详细计算,而不是从中得出的"softmax"概率这些预测使用softmax层.您可以使用这些方程式创建一个"SoftmaxWithInfogainLoss"层,该层在数值上比在softmax层顶部计算信息增益损失更可靠:

Here's a detailed computation of the forward and backward passes of "SoftmaxWithLoss" and "InfogainLoss" layers with respect to the raw predictions (x), rather than the "softmax" probabilities derived from these predictions using a softmax layer. You can use these equations to create a "SoftmaxWithInfogainLoss" layer that is more numerically robust than computing infogain loss on top of a softmax layer:

PS,
请注意,如果要使用信息增益损失进行称重,则应该给H(infogain_mat)添加标签相似性,而不是距离.

PS,
Note that if you are going to use infogain loss for weighing, you should feed H (the infogain_mat) with label similarities, rather than distances.

更新:
我最近实现了这种鲁棒的梯度计算,并创建了此拉取请求.该PR于2017年4月合并到master分支.

Update:
I recently implemented this robust gradient computation and created this pull request. This PR was merged to master branch on April, 2017.

这篇关于根据“坏"缩放损失值.在咖啡中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆