如何正确更新 PyTorch 中的权重? [英] How to properly update the weights in PyTorch?

查看:22
本文介绍了如何正确更新 PyTorch 中的权重?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试根据此 schema 使用 PyTorch 实现梯度下降但无法弄清楚如何正确更新权重.这只是一个玩具示例,有 2 个线性层,隐藏层中有 2 个节点,一个输出.

I'm trying to implement the gradient descent with PyTorch according to this schema but can't figure out how to properly update the weights. It is just a toy example with 2 linear layers with 2 nodes in hidden layer and one output.

Learning rate = 0.05;
target output = 1

https://hmkcode.github.io/ai/backpropagation-一步一步/

转发

向后

我的代码如下:

    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    import torch.optim as optim

    class MyNet(nn.Module):

    def __init__(self):
         super(MyNet, self).__init__()
         self.linear1 = nn.Linear(2, 2,  bias=None)
         self.linear1.weight = torch.nn.Parameter(torch.tensor([[0.11, 0.21], [0.12, 0.08]]))
         self.linear2 = nn.Linear(2, 1,  bias=None)
         self.linear2.weight = torch.nn.Parameter(torch.tensor([[0.14, 0.15]]))

    def forward(self, inputs):
         out = self.linear1(inputs)
         out = self.linear2(out)
         return out

    losses = []
    loss_function = nn.L1Loss()
    model = MyNet()
    optimizer = optim.SGD(model.parameters(), lr=0.05)
    input = torch.tensor([2.0,3.0])
    print('weights before backpropagation = ',   list(model.parameters()))
    for epoch in range(1):
       result = model(input )
       loss = loss_function(result , torch.tensor([1.00],dtype=torch.float))
       print('result = ', result)
       print("loss = ",   loss)
       model.zero_grad()
       loss.backward()
       print('gradients =', [x.grad.data  for x in model.parameters()] )
       optimizer.step()
       print('weights after backpropagation = ',   list(model.parameters())) 

结果如下:

    weights before backpropagation =  [Parameter containing:
    tensor([[0.1100, 0.2100],
            [0.1200, 0.0800]], requires_grad=True), Parameter containing:
    tensor([[0.1400, 0.1500]], requires_grad=True)]

    result =  tensor([0.1910], grad_fn=<SqueezeBackward3>)
    loss =  tensor(0.8090, grad_fn=<L1LossBackward>)

    gradients = [tensor([[-0.2800, -0.4200], [-0.3000, -0.4500]]), 
                 tensor([[-0.8500, -0.4800]])]

    weights after backpropagation =  [Parameter containing:
    tensor([[0.1240, 0.2310],
            [0.1350, 0.1025]], requires_grad=True), Parameter containing:
    tensor([[0.1825, 0.1740]], requires_grad=True)]

前传值:

2x0.11 + 3*0.21=0.85 ->  
2x0.12 + 3*0.08=0.48 -> 0.85x0.14 + 0.48*0.15=0.191 -> loss =0.191-1 = -0.809  

Backward pass:让我们计算 w5 和 w6(输出节点权重)

Backward pass: let's calculate w5 and w6 (output node weights)

w = w - (prediction-target)x(gradient)x(output of previous node)x(learning rate)  
w5= 0.14 -(0.191-1)*1*0.85*0.05= 0.14 + 0.034= 0.174  
w6= 0.15 -(0.191-1)*1*0.48*0.05= 0.15 + 0.019= 0.169 

在我的示例中,Torch 不会将损失乘以导数,因此更新后我们会得到错误的权重.对于输出节点,我们得到了新的权重 w5,w6 [0.1825, 0.1740] ,当它应该是 [0.174, 0.169]

In my example Torch doesn't multiply the loss by derivative so we get wrong weights after updating. For the output node we got new weights w5,w6 [0.1825, 0.1740] , when it should be [0.174, 0.169]

后移更新我们需要计算的输出节点(w5)的第一个权重:(prediction-target)x(gradient)x(前一个节点的输出)x(learning rate)=-0.809*1*0.85*0.05=-0.034.更新权重 w5 = 0.14-(-0.034)=0.174.但是 pytorch 计算了 new weight = 0.1825.它忘记乘以(prediction-target)=-0.809.对于输出节点,我们得到了 -0.8500 和 -0.4800 的梯度.但是我们仍然需要将它们乘以损失 0.809 和学习率 0.05,然后才能更新权重.

Moving backward to update the first weight of the output node (w5) we need to calculate: (prediction-target)x(gradient)x(output of previous node)x(learning rate)=-0.809*1*0.85*0.05=-0.034. Updated weight w5 = 0.14-(-0.034)=0.174. But instead pytorch calculated new weight = 0.1825. It forgot to multiply by (prediction-target)=-0.809. For the output node we got gradients -0.8500 and -0.4800. But we still need to multiply them by loss 0.809 and learning rate 0.05 before we can update the weights.

这样做的正确方法是什么?我们是否应该将 'loss' 作为参数传递给 backward(),如下所示: loss.backward(loss) .

What is the proper way of doing this? Should we pass 'loss' as an argument to backward() as following: loss.backward(loss) .

这似乎解决了它.但是我在文档中找不到任何关于此的示例.

That seems to fix it. But I couldn't find any example on this in documentation.

推荐答案

你应该使用 .zero_grad() 和优化器,所以 optimizer.zero_grad(),不会丢失或评论中建议的模型(虽然模型很好,但 IMO 不清楚或可读).

You should use .zero_grad() with optimizer, so optimizer.zero_grad(), not loss or model as suggested in the comments (though model is fine, but it is not clear or readable IMO).

除非你的参数更新得很好,所以错误不在 PyTorch 一边.

Except that your parameters are updated fine, so the error is not on PyTorch's side.

基于您提供的梯度值:

gradients = [tensor([[-0.2800, -0.4200], [-0.3000, -0.4500]]), 
             tensor([[-0.8500, -0.4800]])]

让我们将所有这些乘以您的学习率 (0.05):

Let's multiply all of them by your learning rate (0.05):

gradients_times_lr = [tensor([[-0.014, -0.021], [-0.015, -0.0225]]), 
                      tensor([[-0.0425, -0.024]])]

最后,让我们应用普通的 SGD(theta -= gradient * lr),以获得与 PyTorch 完全相同的结果:

Finally, let's apply ordinary SGD (theta -= gradient * lr), to get exactly the same results as in PyTorch:

parameters = [tensor([[0.1240, 0.2310], [0.1350, 0.1025]]),
              tensor([[0.1825, 0.1740]])]

你所做的是将 PyTorch 计算的梯度与前一个节点的输出相乘,这不是它的工作原理!.

What you have done is taken the gradients calculated by PyTorch and multiplied them with the output of previous node and that's not how it works!.

你做了什么:

w5= 0.14 -(0.191-1)*1*0.85*0.05= 0.14 + 0.034= 0.174  

应该做什么(使用 PyTorch 的结果):

w5 = 0.14 - (-0.85*0.05) = 0.1825

不与前一个节点相乘,它是在幕后完成的(这就是 .backprop() 所做的 - 为所有节点计算 正确 梯度),无需将它们乘以之前的.

No multiplication of previous node, it's done behind the scenes (that's what .backprop() does - calculates correct gradients for all of the nodes), no need to multiply them by previous ones.

如果您想手动计算它们,则必须从损失开始(增量为 1)并一直向下进行反向传播(此处不要使用学习率,这是另一回事!).

If you want to calculate them manually, you have to start at the loss (with delta being one) and backprop all the way down (do not use learning rate here, it's a different story!).

在计算完所有这些之后,您可以将每个权重乘以优化器学习率(或任何其他与此相关的公式,例如 Momentum),然后您就可以得到正确的更新.

After all of them are calculated, you can multiply each weight by optimizers learning rate (or any other formula for that matter, e.g. Momentum) and after this you have your correct update.

学习率不是反向传播的一部分,在计算所有梯度之前不要管它(它将单独的算法、优化程序和反向传播混淆在一起).

Learning rate is not part of backpropagation, leave it alone until you calculate all of the gradients (it confuses separate algorithms together, optimization procedures and backpropagation).

好吧,我不知道您为什么使用 Mean Absolute Error(而在本教程中使用的是 Mean Squared Error),这就是为什么这两个结果不同的原因.但是,让我们按照您的选择去做.

Well, I don't know why you are using Mean Absolute Error (while in the tutorial it is Mean Squared Error), and that's why both those results vary. But let's go with your choice.

导数 |y_true - y_pred |w.r.t.to y_pred 是 1,所以 IT IS NOT 与 loss 相同.更改为 MSE 以获得相等的结果(这里,导数将为 (1/2 * y_pred - y_true),但我们通常将 MSE 乘以 2 以去除第一个乘法).

Derivative of | y_true - y_pred | w.r.t. to y_pred is 1, so IT IS NOT the same as loss. Change to MSE to get equal results (here, the derivative will be (1/2 * y_pred - y_true), but we usually multiply MSE by two in order to remove the first multiplication).

MSE的情况下,您将乘以损失值,但这完全取决于损失函数(有点遗憾,您使用的教程没有指出这一点).

In MSE case you would multiply by the loss value, but it depends entirely on the loss function (it was a bit unfortunate that the tutorial you were using didn't point this out).

您可能可以从这里开始,但是...总误差 w.r.t 到 w5 的导数是 h1 的输出(在这种情况下为 0.85).我们将其乘以总误差 w.r.t 的导数.输出(它是 1!)并获得 0.85,就像在 PyTorch 中所做的那样.w6 也有同样的想法.

You could probably go from here, but... Derivative of total error w.r.t to w5 is the output of h1 (0.85 in this case). We multiply it by derivative of total error w.r.t. output (it is 1!) and obtain 0.85, as done in PyTorch. Same idea goes for w6.

我严重建议您不要将学习率与反向传播混为一谈,这会让您的生活变得更加艰难(反向传播 IMO 并不容易,非常违反直觉),而且这是两个独立的事情(可以't 强调一个就够了).

I seriously advise you not to confuse learning rate with backprop, you are making your life harder (and it's not easy with backprop IMO, quite counterintuitive), and those are two separate things (can't stress that one enough).

这个来源很好,更一步一步,更复杂的网络思想(包括激活),所以如果你通过所有这些,你可以更好地掌握.

This source is nice, more step-by-step, with a little more complicated network idea (activations included), so you can get a better grasp if you go through all of it.

此外,如果您真的很热衷(而且您似乎很热衷),要了解更多的来龙去脉,请计算其他优化器(例如,nesterov)的权重修正,这样您就知道为什么我们应该将这些想法分开.

Furthermore, if you are really keen (and you seem to be), to know more ins and outs of this, calculate the weight corrections for other optimizers (say, nesterov), so you know why we should keep those ideas separated.

这篇关于如何正确更新 PyTorch 中的权重?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆