为什么更高层需要深度复制基本模型的参数以创建功能模型? [英] Why does higher need to deep copy the parameters of the base model to create a functional model?

查看:36
本文介绍了为什么更高层需要深度复制基本模型的参数以创建功能模型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这行代码中找到了更高的库:

I found this line of code in the higher library:

self.param_groups = _copy.deepcopy(other.param_groups)

我不明白为什么需要这样做.

and I don't understand why that's needed.

如果我在此处中概述了任何有害的内容,则可以使用.您可以转到问题查看我的原因,但要旨是:

If anything I think it's harmful as I've outline here. You can go to the issue to see my reasons but the gist is this:

深拷贝是否意味着(外循环)优化器将针对计算图中不存在的参数计算梯度?由于:

Wouldn't having that deep copy mean the (outer loop) optimizer would be computing the gradients with respect to parameters no present in the computation graph? Since:

与初始参数/权重相比,微分/内部优化器的参数是深层副本外部优化器(例如Adam)将具有原始/初始参数,因此这些参数的梯度应始终为零.这是我过去可以想到的唯一解释(梯度意外地为零),但是似乎更高的MAML教程有效,这与我的理论背道而驰.如果我的理论在 MAML 的内部循环结束时是正确的,并且当外部优化器(通常是 Adam)计算梯度时,它们应该为零(我有时会观察到).但是我认为它们不为零,否则该教程将无法正常工作.

the parameters of the differentiable/inner optimizer are a deep copy compared to the initial parameters/weights the outer optimizer (e.g. Adam) would have the original/initial parameters, so the gradient of these should always be zero. That is the only explanation that I can think of to explain my issues in the past (gradients being zero unexpectedly) however it seems the higher MAML tutorial works, which should go against my theory. If my theory is right at the end of the inner loop of MAML and when the outer optimizer (usually Adam) computes the gradients, they should be zero (which I have observed sometimes). But I assume they are NOT zero, otherwise that tutorial wouldn't work.

因此,我在询问在创建内部优化器时是否需要使用深层复制.它的目的是什么,为什么不引起我在更高版本的原始MAML教程中描述的问题.深度复制不会破坏前向传递,从而梯度的整个计算如何使用外部优化器将使用的初始化呢?

So I am inquiring about the need to use deep copy when creating inner optimizers. What is its purpose and why is it not causing the issues I describe in the original MAML tutorial in higher. How is it that the deep copy doesn't break the forward pass and thus the whole computation of gradient wrt the initialization that the outer optimizer would use?

我认为我困惑的核心是我不明白为什么我们首先需要进行 deepcopy .如果没有所有其他代码(对我来说似乎很麻烦),我们甚至可能冒着我们可能要使用外部优化器进行初始化的风险,因为外部/元优化器具有指向原始模型的参数的指针,而不是内部优化程序可以拥有的深层副本.

I think at the core of my confusion is that I don't understand why we need to do the deepcopy in the first place. Without all the other code (that seems convoluted to me) we even risk that the initialization we might want to train with the outer optimizer might not train, since the outer/meta optimizer has a pointer to the params of the original model and not a copy of the deep copy the inner optimizer could have had.

为什么开发人员会通过添加似乎有很高风险的大量代码来经历所有这些事情?

Why would the developers go through all that by adding substantial code that seems to have high risks?

有关初始参数复制如何在更高级别发生的相关问题:

Related question on how the copying of the initial parameters happens in higher: What does the copy_initial_weights documentation mean in the higher library for Pytorch?

推荐答案

该行的主要原因是复制所有内容,但根据以后的代码判断,这些内容是可训练的权重.不幸的是,不复制权重也很难实现,因此仅使用对Deepcopy的调用.

The main reason for that line is to copy everything but the trainable weights judging by the later code. Unfortunately it is difficult to achieve without copying weights too, so just a call to deepcopy is used.

如果您跟踪 self.param_groups 的用法,您会发现每个元素的'params'实际上只是稍后被None代替.

If you trace how self.param_groups are used you will find that 'params' of each element is actually just replaced by None later here.

可微分优化器的初始化在这里需要制作参考 other 优化器具有的所有参数的副本(包括张量和非张量参数,例如 lr ,并声明例如 momentum_buffer ,但以后会复制状态这里).这有效地创建了 other 优化器所有参数的快照,但本来应该将梯度累积的 other 可训练权重除外.因此,总体而言,梯度不会通过这些副本传播-它们会通过 fmodel 的初始权重(如果该模型为 copy_initial_weights = False )和/或通过需要梯度的张量进行传播使用 override 传递给差异化优化器.

The initialization of differentiable optimizer here needs to make copies of all parameters the reference other optimizer has (including tensor and non-tensor ones such as lr, and states e.g. momentum_buffer, but states are copied later here). This is effectively creating a snapshot of all parameters of other optimizer except for the trainable weights other was supposed to accumulate gradients into. So overall the gradients don't propagate through these copies - they propagate through initial weights of fmodel (if copy_initial_weights=False for that model) and/or through tensors requiring gradient which were passed to differentiable optimizer using override.

这篇关于为什么更高层需要深度复制基本模型的参数以创建功能模型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆