根据“badness"缩放损失值在咖啡馆 [英] scale the loss value according to "badness" in caffe

查看:12
本文介绍了根据“badness"缩放损失值在咖啡馆的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想根据训练期间当前预测"与正确标签"的接近/远近来缩放每个图像的损失值.例如,如果正确的标签是猫",而网络认为它是狗",那么惩罚(损失)应该小于网络认为它是汽车"的情况.

我的做法如下:

1- 我定义了标签之间距离的矩阵,
2- 将该矩阵作为底部传递到 "softmaxWithLoss" 层,
3- 将每个 log(prob) 乘以这个值,根据 forward_cpu

中的劣势来缩放损失

但是我不知道在 backward_cpu 部分应该做什么.我知道必须更改渐变 (bottom_diff) 但不太确定如何在此处合并比例值.根据数学,我必须按比例缩放梯度(因为它只是一个比例)但不知道如何.

此外,如果我没记错的话,caffe 中似乎有一个名为 "InfoGainLoss" 的 loosLayer,它的工作非常相似,但是该层的后面部分有点令人困惑:

bottom_diff[i * dim + j] = scale * infogain_mat[label * dim + j]/prob;

我不知道为什么 infogain_mat[] 是除以 prob 而不是乘以!如果我对 infogain_mat 使用单位矩阵,它是不是应该像向前和向后的 softmax 损失一样?

如果有人能给我一些指点,将不胜感激.

解决方案

您正确地观察到您为 log(prob) 所做的缩放正是 "InfogainLoss" 层正在执行(您可以在

PS,
请注意,如果您打算使用 infogain loss 进行称重,您应该为 H(infogain_mat)提供标签 similarities,而不是距离.

更新:
我最近实现了这个强大的梯度计算并创建了这个拉取请求.本PR于2017年4月合并到master分支.

I want to scale the loss value of each image based on how close/far is the "current prediction" to the "correct label" during the training. For example if the correct label is "cat" and the network think it is "dog" the penalty (loss) should be less than the case if the network thinks it is a "car".

The way that I am doing is as following:

1- I defined a matrix of the distance between the labels,
2- pass that matrix as a bottom to the "softmaxWithLoss" layer,
3- multiply each log(prob) to this value to scale the loss according to badness in forward_cpu

However I do not know what should I do in the backward_cpu part. I understand the gradient (bottom_diff) has to be changed but not quite sure, how to incorporate the scale value here. According to the math I have to scale the gradient by the scale (because it is just an scale) but don't know how.

Also, seems like there is loosLayer in caffe called "InfoGainLoss" that does very similar job if I am not mistaken, however the backward part of this layer is a little confusing:

bottom_diff[i * dim + j] = scale * infogain_mat[label * dim + j] / prob;

I am not sure why infogain_mat[] is divide by prob rather than being multiply by! If I use identity matrix for infogain_mat isn't it supposed to act like softmax loss in both forward and backward?

It will be highly appreciated if someone can give me some pointers.

解决方案

You are correct in observing that the scaling you are doing for the log(prob) is exactly what "InfogainLoss" layer is doing (You can read more about it here and here).

As for the derivative (back-prop): the loss computed by this layer is

L = - sum_j infogain_mat[label * dim + j] * log( prob(j) )

If you differentiate this expression with respect to prob(j) (which is the input variable to this layer), you'll notice that the derivative of log(x) is 1/x this is why you see that

dL/dprob(j) = - infogain_mat[label * dim + j] / prob(j) 

Now, why don't you see similar expression in the back-prop of "SoftmaxWithLoss" layer?
well, as the name of that layer suggests it is actually a combination of two layers: softmax that computes class probabilities from classifiers outputs and a log loss layer on top of it. Combining these two layer enables a more numerically robust estimation of the gradients.
Working a little with "InfogainLoss" layer I noticed that sometimes prob(j) can have a very small value leading to unstable estimation of the gradients.

Here's a detailed computation of the forward and backward passes of "SoftmaxWithLoss" and "InfogainLoss" layers with respect to the raw predictions (x), rather than the "softmax" probabilities derived from these predictions using a softmax layer. You can use these equations to create a "SoftmaxWithInfogainLoss" layer that is more numerically robust than computing infogain loss on top of a softmax layer:

PS,
Note that if you are going to use infogain loss for weighing, you should feed H (the infogain_mat) with label similarities, rather than distances.

Update:
I recently implemented this robust gradient computation and created this pull request. This PR was merged to master branch on April, 2017.

这篇关于根据“badness"缩放损失值在咖啡馆的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆