为什么更高层需要深度复制基础模型的参数来创建功能模型? [英] Why does higher need to deep copy the parameters of the base model to create a functional model?

查看:17
本文介绍了为什么更高层需要深度复制基础模型的参数来创建功能模型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现 这行代码上级图书馆:

self.param_groups = _copy.deepcopy(other.param_groups)

我不明白为什么需要这样做.

and I don't understand why that's needed.

如果我认为这是有害的,因为我已经概述了这里.您可以转到问题查看我的原因,但要点是:

If anything I think it's harmful as I've outline here. You can go to the issue to see my reasons but the gist is this:

拥有深拷贝是否意味着(外循环)优化器将计算关于计算图中不存在的参数的梯度?由于:

Wouldn't having that deep copy mean the (outer loop) optimizer would be computing the gradients with respect to parameters no present in the computation graph? Since:

与初始参数/权重相比,可微分/内部优化器的参数是一个深拷贝外部优化器(例如 Adam)将具有原始/初始参数,因此这些参数的梯度应始终为零.这是我过去能想到的唯一解释来解释我的问题(梯度意外为零),但似乎更高的 MAML 教程有效,这应该与我的理论相悖.如果我的理论在 MAML 内部循环的末尾是正确的,并且当外部优化器(通常是 Adam)计算梯度时,它们应该为零(我有时会观察到).但我假设它们不为零,否则该教程将不起作用.

the parameters of the differentiable/inner optimizer are a deep copy compared to the initial parameters/weights the outer optimizer (e.g. Adam) would have the original/initial parameters, so the gradient of these should always be zero. That is the only explanation that I can think of to explain my issues in the past (gradients being zero unexpectedly) however it seems the higher MAML tutorial works, which should go against my theory. If my theory is right at the end of the inner loop of MAML and when the outer optimizer (usually Adam) computes the gradients, they should be zero (which I have observed sometimes). But I assume they are NOT zero, otherwise that tutorial wouldn't work.

所以我询问在创建内部优化器时是否需要使用深拷贝.它的目的是什么,为什么它不会导致我在更高的原始 MAML 教程中描述的问题.深拷贝如何不破坏前向传递,从而不破坏梯度的整个计算,而不是外部优化器将使用的初始化?

So I am inquiring about the need to use deep copy when creating inner optimizers. What is its purpose and why is it not causing the issues I describe in the original MAML tutorial in higher. How is it that the deep copy doesn't break the forward pass and thus the whole computation of gradient wrt the initialization that the outer optimizer would use?

我认为我困惑的核心是我不明白为什么我们首先需要进行deepcopy.如果没有所有其他代码(这对我来说似乎很复杂),我们甚至冒着我们可能想用外部优化器训练的初始化可能无法训练的风险,因为外部/元优化器有一个指向原始模型参数的指针,而不是一个内部优化器可能拥有的深层副本的副本.

I think at the core of my confusion is that I don't understand why we need to do the deepcopy in the first place. Without all the other code (that seems convoluted to me) we even risk that the initialization we might want to train with the outer optimizer might not train, since the outer/meta optimizer has a pointer to the params of the original model and not a copy of the deep copy the inner optimizer could have had.

为什么开发人员要通过添加看似具有高风险的大量代码来完成所有这些工作?

Why would the developers go through all that by adding substantial code that seems to have high risks?

关于初始参数复制如何发生的相关问题:Pytorch 的高级库中的 copy_initial_weights 文档是什么意思?

Related question on how the copying of the initial parameters happens in higher: What does the copy_initial_weights documentation mean in the higher library for Pytorch?

推荐答案

该行的主要原因是复制所有根据后面的代码判断的可训练权重.不幸的是,如果不复制权重也很难实现,因此只需调用 deepcopy.

The main reason for that line is to copy everything but the trainable weights judging by the later code. Unfortunately it is difficult to achieve without copying weights too, so just a call to deepcopy is used.

如果你追踪 self.param_groups 是如何使用的,你会发现每个元素的 'params' 实际上只是在之后被 None 这里.

If you trace how self.param_groups are used you will find that 'params' of each element is actually just replaced by None later here.

这里的可微优化器的初始化需要复制引用other优化器的所有参数(包括张量和非张量的如lr,状态如momentum_buffer,但稍后会复制状态 noreferr>这里).这有效地创建了 other 优化器的所有参数的快照,除了 other 应该累积梯度的可训练权重.所以总的来说梯度不会通过这些副本传播——它们通过 fmodel 的初始权重(如果该模型的 copy_initial_weights=False )和/或通过需要梯度的张量传播使用 override 传递给可微优化器.

The initialization of differentiable optimizer here needs to make copies of all parameters the reference other optimizer has (including tensor and non-tensor ones such as lr, and states e.g. momentum_buffer, but states are copied later here). This is effectively creating a snapshot of all parameters of other optimizer except for the trainable weights other was supposed to accumulate gradients into. So overall the gradients don't propagate through these copies - they propagate through initial weights of fmodel (if copy_initial_weights=False for that model) and/or through tensors requiring gradient which were passed to differentiable optimizer using override.

这篇关于为什么更高层需要深度复制基础模型的参数来创建功能模型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆