copy_initial_weights 文档在 Pytorch 的更高库中是什么意思? [英] What does the copy_initial_weights documentation mean in the higher library for Pytorch?

查看:19
本文介绍了copy_initial_weights 文档在 Pytorch 的更高库中是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用更高的库进行元学习,但我在理解 copy_initial_weights 的含义时遇到了问题.文档说:

I was trying to use the higher library for meta-learning and I was having issues understanding what the copy_initial_weights mean. The docs say:

copy_initial_weights——如果为真,则修补模块的权重被复制以形成修补模块的初始权重,因此在展开修补模块时不是梯度带的一部分.如果将其设置为 False,则实际模块权重将是修补模块的初始权重.例如,这在执行 MAML 时很有用.

copy_initial_weights – if true, the weights of the patched module are copied to form the initial weights of the patched module, and thus are not part of the gradient tape when unrolling the patched module. If this is set to False, the actual module weights will be the initial weights of the patched module. This is useful when doing MAML, for example.

但由于以下原因,这对我来说没有多大意义:

but that doesn't make much sense to me because of the following:

例如,修补模块的权重被复制以形成修补模块的初始权重"对我来说没有意义,因为当上下文管理器启动时,修补模块还不存在.所以不清楚我们从哪里复制什么(以及为什么复制是我们想要做的事情).

For example, "the weights of the patched module are copied to form the initial weights of the patched module" doesn't make sense to me because when the context manager is initiated a patched module does not exist yet. So it is unclear what we are copying from and to where (and why copying is something we want to do).

此外,展开修补模块"对我来说也没有意义.我们通常展开由 for 循环引起的计算图.一个补丁模块只是一个被这个库修改过的神经网络.展开不明确.

Also, "unrolling the patched module" does not make sense to me. We usually unroll a computaiton graph caused by a for loop. A patched module is just a neural net that has been modified by this library. Unrolling is ambiguous.

此外,渐变胶带"也没有技术定义.

Also, there isn't a technical definition for "gradient tape".

此外,在描述 false 是什么时,说它对 MAML 有用实际上并没有用,因为它甚至没有暗示为什么它对 MAML 有用.

Also, when describing what false is, saying that it's useful for MAML isn't actually useful because it doesn't even hint why it's useful for MAML.

总的来说,无法使用上下文管理器.

Overall, it's impossible to use the context manager.

以更精确的术语解释该标志的作用的任何解释和示例都将非常有价值.

Any explanations and examples of what the that flag does in more precise terms would be really valuable.

相关:

推荐答案

我认为现在或多或少清楚这对我意味着什么.

I think it's more or less clear what this means now to me.

首先我想明确一些符号,特别是关于内部时间步长和外部时间步长(也称为剧集)的索引:

First I'd like to make some notation clear, specially with respect to indices wrt inner time step and outer time step (also known as episodes):

W^<inner_i, outer_i> = denotes the value a tensor has at time step inner_i, outer_i.

在训练开始时神经网络有参数:

At the beginning of training a neural net has params:

W^<0,0>

并保存在它的模块中.为了解释起见,特定的张量(对于基础模型)将被表示:

and are held inside it's module. For the sake of explanation the specific tensor (for the base model) will be denoted:

W = the weight holding the weights for the model. This can be thought as the initialization of the model.

并且将使用就地操作进行更新(这很重要,因为 W 是所有 W^<0,outer_i> 的占位符,用于所有外部外部优化器在正常"元学习期间的步长值).我想强调的是 W 是正常 Pytorch 神经网络基础模型的张量.通过使用外部优化器(如 Adam)就地更改这一点,我们可以有效地训练初始化.外层优化器将使用该张量的梯度在整个展开的内循环过程中进行更新.

and will be updated with with an in-place operation (this is important since W is the placeholder for all W^<0,outer_i> for all outer step values during "normal" meta-learning) by the outer optimizer. I want to emphasize that W is the tensor for the normal Pytorch neural net base model. By changing this in-place with an outer optimizer (like Adam) we are effectively training the initialization. The outer optimizer will use the gradients wrt this tensor to do the update through the whole unrolled inner loop process.

当我们说 copy_initial_weights=False 时,我们的意思是我们将有一个直接到 W 的渐变路径,无论它当前具有什么值.通常,上下文管理器在外部步骤完成之后的内部循环之前完成,因此 W 将具有当前步骤的 W^<0,outer_i>.特别是执行此操作的代码是这个对于 copy_initial_weight=False:

When we say copy_initial_weights=False we mean that we will have a gradient path directly to W with whatever value it currently has. Usually the context manager is done before a inner loop after an outer step has been done so W will have W^<0,outer_i> for the current step. In particular the code that does this is this one for copy_initial_weight=False:

params = [ p.clone() if device is None else p.clone().to(device) for p in module.parameters() ]

如果您不熟悉 clone,这可能看起来令人困惑,但它所做的是复制 W当前权重.不寻常的是,clone 还记住了它来自的张量的梯度历史(.clone() 作为身份).它的主要用途是为用户在其可微优化器中执行危险的就地操作添加额外的安全层.假设用户从未对就地操作做过任何疯狂的事情,理论上可以删除 .clone().这令人困惑的原因恕我直言是因为在 Pytorch 中复制"(紧贴)不会自动阻止梯度流,这就是真实"副本会做的事情(即创建一个 100% 完全独立的张量).这不是 clone 所做的,也不是 copy_initial_weights 所做的.

this might look confusing if you're not familiar with clone but what it's doing is making a copy of the current weight of W. The unusual thing is that clone also remembers the gradient history from the tensor it came from (.clone() is as identity). It's main use it to add an extra layer of safety from the user doing dangerous in-place ops in it's differentiable optimizer. Assuming the user never did anything crazy with in-place ops one could in theory remove the .clone(). the reason this is confusing imho is because "copying in Pytorch" (clinging) does not automatically block gradient flows, which is what a "real" copy would do (i.e. create a 100% totally separate tensor). This is not what clone does and that is not what copy_initial_weights does.

copy_initial_weights=True 时,真正发生的是权重被克隆和分离.查看它最终运行的代码(这里a href="https://github.com/facebookresearch/higher/blob/e45c1a059e39a16fa016d37bc15397824c65547c/higher/utils.py#L30" rel="nofollow noreferrer">此处):

When copy_initial_weights=True what really happens is that the weights are cloned and detached. See the code it eventually runs (here and here):

params = [_copy_tensor(p, safe_copy, device) for p in module.parameters()]

运行复制张量(假设他们正在做一个安全的复制,即做额外的克隆):

which runs copy tensor (assuming they are doing a safe copy i.e. doing the extra clone):

 t = t.clone().detach().requires_grad_(t.requires_grad)

注意 .detach() 不会分配新的内存.它与原始张量共享内存,这就是为什么需要 .clone() 使这个操作安全"(通常是就地操作).

Note that .detach() does not allocate new memory. It shares the memory with the original tensor, which is why the .clone() is needed to have this op be "safe" (usually wrt in-place ops).

因此,当 copy_initial_weights 时,它们正在复制和分离 W 的当前值.这通常是 W^<0,outer_i> 如果它在内适应循环中进行通常的元学习.所以copy_initial_weight 的预期语义是那个,而initial_weight 它们只是意味着W.需要注意的重要一点是,内循环中网络的中间张量没有用我的符号表示,而是 fmodel.parameters(t=inner_i).此外,如果事情通常是元学习,我们有 fmodel.parameters(t=0) = W 并且它由外部优化器就地更新.

So when copy_initial_weights they are copying and detaching the current value of W. This is usually W^<0,outer_i> if it's doing usual meta-learning in the inner adaptation loop. So the intended semantics of copy_initial_weight is that and the initial_weight they simply mean W. The important thing to note is that the intermediate tensors for the net in the inner loop are not denoted in my notation but they are fmodel.parameters(t=inner_i). Also if things are usually meta-learning we have fmodel.parameters(t=0) = W and it gets update in-place by the outer optimizer.

请注意,由于外部优化器的就地操作和图的释放,我们从未对初始值取导数 Grad_{W^<0,0>}W.这是我最初认为我们正在做的事情.

Note that because of the outer optimizer's in-place op and the freeing of the graphs we never take the derivate Grad_{W^<0,0>} with respect to the initial value of W. Which was something I initially thought we were doing.

这篇关于copy_initial_weights 文档在 Pytorch 的更高库中是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆