Keras分批训练:训练损失是在每个优化步骤之前还是之后计算的? [英] Keras training with batches: Is the training loss computed before or after each optimization step?

查看:103
本文介绍了Keras分批训练:训练损失是在每个优化步骤之前还是之后计算的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这可能是一个非常基本的问题,但是我找不到答案: 当我使用Keras批量训练网络时,控制台输出会显示并在每个训练时期不断更新训练集当前损失值的显示.据我了解,该损失值是针对当前批次计算的(作为总损失的代理),并且可能与先前批次计算的损失值进行平均. 但是有两种方法可以获取当前批次的损失值:要么在更新参数之前,要么在更新之后.有人可以告诉我两种方法中的哪一种是正确的吗?从我观察到的结果来看,我宁愿猜测是在优化步骤之后.

this is probably a very basic question, however I wasn't able to find an answer to it: When I train a network with Keras using batches, the console output shows and keeps updating a display of the current loss value of the training set during each training epoch. As I understand it, this loss value is computed over the current batch (as a proxy for the overall loss) and probably averaged with the loss values that were calculated for the previous batches. But there are two possibilities to get the loss value of the current batch: Either before updating the parameters or afterwards. Can someone tell me which of the two is correct? From what I observe I would rather guess it is after the optimization step.

为什么我问这个问题:我正在训练一个网络,看到一种行为,训练损失(两个嵌入的MSE)将按预期减少(几个数量级),但是验证损失保持不变.首先,我认为这可能是由于过度拟合造成的.因此,由于训练数据集非常大(200k张图像),因此我决定减小时间段大小,以便能够更频繁地查看验证集,从而导致时间段小于trainingSetSize/batchSize.即使那样,我仍然看到训练损失从一个时期到另一个时期减少(验证损失仍然保持不变),由于网络仍处于第一次看到训练数据的阶段,我发现这很有趣.以我的理解,这意味着我的设置中存在一些讨厌的错误,或者在执行优化步骤后显示了所显示的训练损失.否则,新的,从未见过的批次和验证集的损失至少应表现相似.

Reason why I ask this question: I was training a network and saw a behavior where the training loss (MSE of two embeddings) would decrease as expected (several orders of magnitude), but the validation loss stayed the same. First I thought it might be due to overfitting. In consequence, as the training dataset is quite big (200k images), I decided to decrease the epoch size to be able to see the validation set evaluated more often, resulting in epochs smaller than trainingSetSize/batchSize. Even then I saw the training loss decreasing from epoch to epoch (validation loss still staying the same), which I found quite intriguing as the network was still in the phase where it saw the training data for the very first time. In my understanding this means that either there is some nasty bug in my setup or the displayed training loss is shown on after taking an optimization step. Otherwise, the loss on a new, never seen batch and the validation set should behave at least similar.

即使我假设损失是在每个优化步骤之后计算得出的:假设我的网络没有按照验证集评估的建议进行任何有益的改进,但在看到一个新的,从未见过的批次时,它也应该表现出任意性.然后,训练损失的整体减少将仅归因于优化步骤(这对于手头的批次非常有用,但对于其他数据则不然,显然,这也是一种过拟合).这意味着,如果训练损失不断减少,则每批次的优化步骤将变得更加有效.我使用的是我知道是自适应的亚当优化器,但实际上是否有可能看到训练损失连续且大量减少,而实际上,网络并没有学到任何有用的概括?

Even if I assume that the loss is calculated after each optimization step: Assuming my network makes no useful progress as suggested by the validation set evaluation, it should also behave arbitrary when seeing a new, never seen batch. Then, the whole decrease in training loss would only be due to the optimization step(which would be very good for the batch at hand but not for other data, obviously, so also a kind of overfitting). This would mean, if the training loss keeps decreasing, that the optimization step per batch gets more effective. I am using Adam optimizer which I know is adaptive, but is it really possible to see a continuous and substantial decrease in training loss while in reality, the network doesn't learn any useful generalization?

推荐答案

在优化步骤之前计算损失.这样做的原因是效率,并且与反向传播的工作方式有关.

The loss is computed before the optimization step. The reason for this is efficiency and has to do with how back-propagation works.

尤其是,假设我们要最小化||A(x, z) - y||^2 w.r.t. z.然后,当我们执行反向传播时,我们需要评估此计算图:

In particular, suppose we want to minimize ||A(x, z) - y||^2 w.r.t. z. Then when we perform back-propagation we need to evaluate this computational graph:

A(x, z) -> grad ||. - y||^2 -> backpropagate

现在,如果我们在此添加评估损失"并在更新参数之前 评估损失,则计算图将看起来像这样

Now, if we add a "evaluate loss" to this and evaluate the loss before updating the parameters the computational graph would look like this

           >  grad ||. - y||^2 -> backpropagate
         /
A(x, z) 
         \
           >  ||. - y||^2

另一方面,如果我们在更新损耗后评估损耗 ,则图形将看起来像这样

On the other hand, if we evaluate the loss after updating them, the graph would look like this

A(x, z) -> grad ||. - y||^2 -> backpropagate -> A(x, z) -> ||. - y||^2

因此,如果我们评估更新后的损耗,则需要计算两次,而如果我们在更新之前进行计算,则只需计算一次.因此,在更新之前对其进行计算的速度快一倍.

Hence, if we evaluate the loss after updating, we need to compute A(x, z) twice, whereas if we compute it before updating we only need to compute it once. Hence, computing it before updating becomes twice as fast.

这篇关于Keras分批训练:训练损失是在每个优化步骤之前还是之后计算的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆