Tensorflow:多个损失函数与多个训练操作 [英] Tensorflow: Multiple loss functions vs Multiple training ops

查看:231
本文介绍了Tensorflow:多个损失函数与多个训练操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个 Tensorflow 模型,它可以预测多个输出(具有不同的激活).我认为有两种方法可以做到这一点:

I am creating a Tensorflow model which predicts multiple outputs (with different activations). I think there are two ways to do this:

方法 1: 创建多个损失函数(每个输出一个),合并它们(使用 tf.reduce_meantf.reduce_sum)并将其传递给训练操作,如下所示:

Method 1: Create multiple loss functions (one for each output), merge them (using tf.reduce_mean or tf.reduce_sum) and pass it to the training op like so:

final_loss = tf.reduce_mean(loss1 + loss2)
train_op = tf.train.AdamOptimizer().minimize(final_loss)

方法 2: 创建多个训练操作,然后像这样将它们分组:

Method 2: Create multiple training operations and then group them like so:

train_op1 = tf.train.AdamOptimizer().minimize(loss1)
train_op2 = tf.train.AdamOptimizer().minimize(loss2)
final_train_op = tf.group(train_op1 train_op2)

我的问题是一种方法是否优于另一种方法?还有我不知道的第三种方法吗?

My question is whether one method is advantageous over the other? Is there a third method I don't know?

谢谢

推荐答案

我想提出一个微妙的观点,我认为以前的答案没有提出.

I want to make a subtle point that I don't think was made in previous answers.

如果您使用的是 GradientDescentOptimizer 之类的东西,这些操作将非常相似.那是因为取梯度是一个线性运算,求和的梯度与梯度之和相同.

If you were using something like GradientDescentOptimizer, these would be very similar operations. That's because taking gradients is a linear operation, and the gradient of a sum is the same as the sum of the gradients.

但是,ADAM 做了一些特别的事情:无论损失的规模如何,它都会缩放梯度,以便它们始终与您的学习率一致.如果你将损失乘以 1000,它不会影响 ADAM,因为它会被标准化掉.

But, ADAM does something special: regardless of the scale of your loss, it scales the gradients so that they're always on the order of your learning rate. If you multiplied your loss by 1000, it wouldn't affect ADAM, because the change it would be normalized away.

因此,如果您的两次损失幅度大致相同,则应该没有区别.如果一个比另一个大得多,那么请记住,在最小化之前求和基本上会忽略小的,而进行两个操作将花费相同的努力来最小化两者.

So, if your two losses are roughly the same magnitude, then it shouldn't make a difference. If one is much larger than the other, then keep in mind that summing before the minimization will essentially ignore the small one, while making two ops will spend equal effort minimizing both.

我个人喜欢将它们分开,这样您就可以更好地控制专注于一个或另一个损失的程度.例如,如果是多任务学习,并且一项任务比另一项更重要,那么两个学习率不同的 ops 大致可以完成此任务.

I personally like dividing them up, which gives you more control over how much to focus on one loss or the other. For example, if it was multi-task learning, and one task was more important to get right than the other, two ops with different learning rates roughly accomplishes this.

这篇关于Tensorflow:多个损失函数与多个训练操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆