Pytorch的高级库中的copy_initial_weights文档是什么意思? [英] What does the copy_initial_weights documentation mean in the higher library for Pytorch?
问题描述
我试图使用高级库进行元学习,但在理解 copy_initial_weights
的含义时遇到了问题。文档说:
I was trying to use the higher library for meta-learning and I was having issues understanding what the copy_initial_weights
mean. The docs say:
copy_initial_weights –如果为true,则复制修补模块的权重以形成修补模块的初始权重,并且因此,展开已修补的模块时,它们不属于渐变胶带。如果将其设置为False,则实际模块权重将是修补模块的初始权重。例如,这在进行MAML时很有用。
copy_initial_weights – if true, the weights of the patched module are copied to form the initial weights of the patched module, and thus are not part of the gradient tape when unrolling the patched module. If this is set to False, the actual module weights will be the initial weights of the patched module. This is useful when doing MAML, for example.
但由于以下原因,这对我来说没有太大意义:
but that doesn't make much sense to me because of the following:
例如,复制补丁模块的权重以形成补丁模块的初始权重对我来说是没有意义的,因为当上下文管理器启动时,修补的模块尚不存在。因此,我们不清楚要复制的内容以及复制到何处(以及为什么要复制的原因)。
For example, "the weights of the patched module are copied to form the initial weights of the patched module" doesn't make sense to me because when the context manager is initiated a patched module does not exist yet. So it is unclear what we are copying from and to where (and why copying is something we want to do).
此外,展开已修补的模块并不能对我而言我们通常会展开由for循环引起的计算图。修补的模块只是已被该库修改的神经网络。
Also, "unrolling the patched module" does not make sense to me. We usually unroll a computaiton graph caused by a for loop. A patched module is just a neural net that has been modified by this library. Unrolling is ambiguous.
此外,梯度带没有技术定义。
Also, there isn't a technical definition for "gradient tape".
也,当描述什么是假时,说对MAML有用实际上并没有用,因为它甚至没有暗示为什么对MAML有用。
Also, when describing what false is, saying that it's useful for MAML isn't actually useful because it doesn't even hint why it's useful for MAML.
总的来说,这是不可能的使用上下文管理器。
Overall, it's impossible to use the context manager.
以更精确的方式解释那个标志的作用的任何解释和示例都是非常有价值的。
Any explanations and examples of what the that flag does in more precise terms would be really valuable.
相关:
- gitissue:> https://github.com/facebookresearch/higher/issues/30
- 新的gitissue: https://github.com/facebookresearch/higher/issues/54
- pytorch论坛: https://discuss.pytorch.org/t/why-does-maml-need-copy-initial-weights-false/70387
- pytorch论坛: https:// describe.pytorch.org/t/what-does-copy-initial-weights-do-in-the-higher-library/70384
- 与此相关的重要问题如何复制fmodel参数,以便优化程序工作(以及使用深层复制):为什么更高层需要深度复制基础模型的参数以创建功能模型?
- gitissue: https://github.com/facebookresearch/higher/issues/30
- new gitissue: https://github.com/facebookresearch/higher/issues/54
- pytorch forum: https://discuss.pytorch.org/t/why-does-maml-need-copy-initial-weights-false/70387
- pytorch forum: https://discuss.pytorch.org/t/what-does-copy-initial-weights-do-in-the-higher-library/70384
- important question related to this on how the fmodel parameters are copied so that the optimizers work (and the use of deep copy): Why does higher need to deep copy the parameters of the base model to create a functional model?
推荐答案
短版
使用 model
作为参数调用 higher.innerloop_ctx
创建临时修补模型并为该模型展开优化器:(fmodel,diffopt)
。预期在内部循环中,fmodel将迭代接收一些输入,计算输出和损失,然后将调用 diffopt.step(loss)
。每当 diffopt.step
被称为 fmodel
时,都会创建参数 fmodel.parameters( time = T)
,它是使用以前的张量计算的新张量(完整图允许通过该过程计算梯度)。如果用户在任何时候在任意张量上调用向后
,则常规pytorch梯度计算/累积将以允许梯度传播到例如的方式开始。优化程序的参数(例如 lr
, momentum
-如果将它们作为张量传递,并且需要渐变到更高的.innerloop_ctx
使用 override
)。
Call to higher.innerloop_ctx
with model
as argument create temporary patched model and unrolled optimizer for that model: (fmodel, diffopt)
. It is expected that in the inner loop fmodel will iteratively receive some input, compute output and loss and then diffopt.step(loss)
will be called. Each time diffopt.step
is called fmodel
will create next version of parameters fmodel.parameters(time=T)
which is a new tensor computed using previous ones (with the full graph allowing to compute gradients through the process). If at any point user calls backward
on any tensor, regular pytorch gradient computation/accumulation will start in a way allowing gradients to propagate to e.g. optimizer's parameters (such as lr
, momentum
- if they were passed as tensors requiring gradients to higher.innerloop_ctx
using override
).
<$ c的创建时间版本$ c> fmodel 的参数 fmodel.parameters(time = 0)
是原始 model $的副本c $ c>参数。如果提供了
copy_initial_weights = True
(默认),则 fmodel.parameters(time = 0)
将是克隆
+ 分离
版本的模型
的参数(即将保留值,但会断开与原始模型的所有连接)。如果提供 copy_initial_weights = False
,则 fmodel.parameters(time = 0)
将为 clone
模型的版本参数
,因此将允许渐变传播到原始模型
的参数(请参见 pytorch文档 (在克隆
上)。
Creation-time version of fmodel
's parameters fmodel.parameters(time=0)
is a copy of original model
parameters. If copy_initial_weights=True
provided (default) then fmodel.parameters(time=0)
will be a clone
+detach
'ed version of model
's parameters (i.e. will preserve values, but will severe all connections to the original model). If copy_initial_weights=False
provided, then fmodel.parameters(time=0)
will be clone
'd version of model
's parameters and thus will allow gradients to propagate to original model
's parameters (see pytorch doc on clone
).
术语说明
-
梯度带这里是指pytorch用于进行计算以将梯度传播到所有需要梯度的叶张量的图形。如果在某个时候将链接剪切到某些需要参数的叶子张量,例如
fnet.parameters()
对于copy_initial_weights = True
的情况),则原来的model.parameters()
不再适用于您的meta_loss.backward ()
计算。
gradient tape here is referring to the graph pytorch uses to go through computations to propagate gradients to all leaf tensors requiring gradients. If at some point you cut the link to some leaf tensor requiring parameters (e.g. how it is done for
fnet.parameters()
forcopy_initial_weights=True
case) then the originalmodel.parameters()
won't be "on gradient tape" anymore for yourmeta_loss.backward()
computation.
展开已修补的模块,此处指的是<$ c $的部分pytorch从最新的开始到结束的所有 fnet.parameters(time = T)
时,c> meta_loss.backward()计算最早的(较高
不能控制过程-这只是常规的pytorch梯度计算,较高的
只是负责每次调用 diffopt.step
时,如何从以前的参数创建这些新的 time = T
参数,以及如何 fnet
始终使用最新的进行向前计算)。
unrolling the patched module here refers to the part of meta_loss.backward()
computation when pytorch is going through all fnet.parameters(time=T)
starting from the latest and ending with the earliest (higher
doesn't control the process - this is just regular pytorch gradient computation, higher
is just in charge of how these new time=T
parameters are being created from previous ones each time diffopt.step
is called and how fnet
is always using the latest ones for forward computation).
长版
让我们从头开始。 更高
库的主要功能(实际上只是功能)是以可区分的方式展开模型的参数优化。它可以以直接使用差异化优化器的形式出现,例如通过 higher.get_diff_optim
如此示例或以 higher.innerloop_ctx
的形式,如此示例。
Let's start from the beginning. Main functionality (only functionality, really) of higher
library is unrolling of a model's parameter optimization in a differentiable manner. It can come either in the form of directly using differentiable optimizer through e.g. higher.get_diff_optim
as in this example or in the form of higher.innerloop_ctx
as in this example.
带有<$ c $的选项c> higher.innerloop_ctx 为您包装现有模型中的无状态模型 fmodel
的创建,并为您提供优化器 diffopt
用于此 fmodel
。因此,如更高版本的README.md中所述,它允许您从以下位置切换:
The option with higher.innerloop_ctx
is wrapping the creation of "stateless" model fmodel
from existing model for you and gives you an "optimizer" diffopt
for this fmodel
. So as summarized in the README.md of higher it allows you to switch from:
model = MyModel()
opt = torch.optim.Adam(model.parameters())
for xs, ys in data:
opt.zero_grad()
logits = model(xs)
loss = loss_function(logits, ys)
loss.backward()
opt.step()
至
model = MyModel()
opt = torch.optim.Adam(model.parameters())
with higher.innerloop_ctx(model, opt) as (fmodel, diffopt):
for xs, ys in data:
logits = fmodel(xs) # modified `params` can also be passed as a kwarg
loss = loss_function(logits, ys) # no need to call loss.backwards()
diffopt.step(loss) # note that `step` must take `loss` as an argument!
# At the end of your inner loop you can obtain these e.g. ...
grad_of_grads = torch.autograd.grad(
meta_loss_fn(fmodel.parameters()), fmodel.parameters(time=0))
训练模型
与执行 diffopt.step $ c $之间的差异c>更新
fmodel
的原因是 fmodel
没有像 opt那样就地更新参数原始部分中的.step()
即可。而是每次调用 diffopt.step
时都会以这种方式创建新版本的参数,即 fmodel
将使用new下一步,但仍保留所有先前的。
The difference between training model
and doing diffopt.step
to update fmodel
is that fmodel
is not updating the parameters in-place as opt.step()
in the original part would do. Instead each time diffopt.step
is called new versions of parameters are created in such a way, that fmodel
would use new ones for the next step, but all previous ones are still preserved.
即 fmodel
仅以 fmodel.parameters(time = 0)
开头,但是在您调用之后diffopt.step
N次,您可以要求 fmodel
给您 fmodel.parameters(time = i)
表示所有 i
,最多 N
。请注意, fmodel.parameters(time = 0)
在此过程中根本不会更改,只是每次 fmodel
应用于某些输入,它将使用当前具有的最新版本的参数。
I.e. fmodel
starts with only fmodel.parameters(time=0)
available, but after you called diffopt.step
N times you can ask fmodel
to give you fmodel.parameters(time=i)
for any i
up to N
inclusive. Notice that fmodel.parameters(time=0)
doesn't change in this process at all, just every time fmodel
is applied to some input it will use the latest version of parameters it currently has.
现在,确切的是 fmodel.parameters(time = 0 )
?它是在此处创建的,并取决于 copy_initial_weights
。如果 copy_initial_weights == True
,则 fmodel.parameters(time = 0)
是 clone
和分离
模型的参数
。否则,它们只是克隆
,而不是分离
'!
Now, what exactly is fmodel.parameters(time=0)
? It is created here and depends on copy_initial_weights
. If copy_initial_weights==True
then fmodel.parameters(time=0)
are clone
'd and detach
'ed parameters of model
. Otherwise they are only clone
'd, but not detach
'ed!
这意味着当我们执行元优化步骤时,当且仅当 copy_initial_weights,原始的
。在MAML中,我们要优化模型
的参数实际上会累积梯度== False 的起始权重,因此我们实际上确实需要从元优化步骤中获得梯度。
That means that when we do meta-optimization step, the original model
's parameters will actually accumulate gradients if and only if copy_initial_weights==False
. And in MAML we want to optimize model
's starting weights so we actually do need to get gradients from meta-optimization step.
我认为这里的问题之一是更高
缺乏简单的玩具示例来演示正在发生的事情,而不是急于显示更严肃的示例。因此,让我尝试在这里填补这一空白,并使用我能想到的最简单的玩具示例(使用1重量乘以输入重量乘以该重量的模型)来演示正在发生的事情:
I think one of the issues here is that higher
lack of simpler toy examples to demonstrate what is going on, instead rushing to show more serious things as the examples. So let me try to fill that gap here and demonstrate what is going on using the simplest toy example I could come up with (model with 1 weight which multiplies input by that weight):
import torch
import torch.nn as nn
import torch.optim as optim
import higher
import numpy as np
np.random.seed(1)
torch.manual_seed(3)
N = 100
actual_multiplier = 3.5
meta_lr = 0.00001
loops = 5 # how many iterations in the inner loop we want to do
x = torch.tensor(np.random.random((N,1)), dtype=torch.float64) # features for inner training loop
y = x * actual_multiplier # target for inner training loop
model = nn.Linear(1, 1, bias=False).double() # simplest possible model - multiple input x by weight w without bias
meta_opt = optim.SGD(model.parameters(), lr=meta_lr, momentum=0.)
def run_inner_loop_once(model, verbose, copy_initial_weights):
lr_tensor = torch.tensor([0.3], requires_grad=True)
momentum_tensor = torch.tensor([0.5], requires_grad=True)
opt = optim.SGD(model.parameters(), lr=0.3, momentum=0.5)
with higher.innerloop_ctx(model, opt, copy_initial_weights=copy_initial_weights, override={'lr': lr_tensor, 'momentum': momentum_tensor}) as (fmodel, diffopt):
for j in range(loops):
if verbose:
print('Starting inner loop step j=={0}'.format(j))
print(' Representation of fmodel.parameters(time={0}): {1}'.format(j, str(list(fmodel.parameters(time=j)))))
print(' Notice that fmodel.parameters() is same as fmodel.parameters(time={0}): {1}'.format(j, (list(fmodel.parameters())[0] is list(fmodel.parameters(time=j))[0])))
out = fmodel(x)
if verbose:
print(' Notice how `out` is `x` multiplied by the latest version of weight: {0:.4} * {1:.4} == {2:.4}'.format(x[0,0].item(), list(fmodel.parameters())[0].item(), out[0].item()))
loss = ((out - y)**2).mean()
diffopt.step(loss)
if verbose:
# after all inner training let's see all steps' parameter tensors
print()
print("Let's print all intermediate parameters versions after inner loop is done:")
for j in range(loops+1):
print(' For j=={0} parameter is: {1}'.format(j, str(list(fmodel.parameters(time=j)))))
print()
# let's imagine now that our meta-learning optimization is trying to check how far we got in the end from the actual_multiplier
weight_learned_after_full_inner_loop = list(fmodel.parameters())[0]
meta_loss = (weight_learned_after_full_inner_loop - actual_multiplier)**2
print(' Final meta-loss: {0}'.format(meta_loss.item()))
meta_loss.backward() # will only propagate gradient to original model parameter's `grad` if copy_initial_weight=False
if verbose:
print(' Gradient of final loss we got for lr and momentum: {0} and {1}'.format(lr_tensor.grad, momentum_tensor.grad))
print(' If you change number of iterations "loops" to much larger number final loss will be stable and the values above will be smaller')
return meta_loss.item()
print('=================== Run Inner Loop First Time (copy_initial_weights=True) =================\n')
meta_loss_val1 = run_inner_loop_once(model, verbose=True, copy_initial_weights=True)
print("\nLet's see if we got any gradient for initial model parameters: {0}\n".format(list(model.parameters())[0].grad))
print('=================== Run Inner Loop Second Time (copy_initial_weights=False) =================\n')
meta_loss_val2 = run_inner_loop_once(model, verbose=False, copy_initial_weights=False)
print("\nLet's see if we got any gradient for initial model parameters: {0}\n".format(list(model.parameters())[0].grad))
print('=================== Run Inner Loop Third Time (copy_initial_weights=False) =================\n')
final_meta_gradient = list(model.parameters())[0].grad.item()
# Now let's double-check `higher` library is actually doing what it promised to do, not just giving us
# a bunch of hand-wavy statements and difficult to read code.
# We will do a simple SGD step using meta_opt changing initial weight for the training and see how meta loss changed
meta_opt.step()
meta_opt.zero_grad()
meta_step = - meta_lr * final_meta_gradient # how much meta_opt actually shifted inital weight value
meta_loss_val3 = run_inner_loop_once(model, verbose=False, copy_initial_weights=False)
meta_loss_gradient_approximation = (meta_loss_val3 - meta_loss_val2) / meta_step
print()
print('Side-by-side meta_loss_gradient_approximation and gradient computed by `higher` lib: {0:.4} VS {1:.4}'.format(meta_loss_gradient_approximation, final_meta_gradient))
产生以下输出的内容:
=================== Run Inner Loop First Time (copy_initial_weights=True) =================
Starting inner loop step j==0
Representation of fmodel.parameters(time=0): [tensor([[-0.9915]], dtype=torch.float64, requires_grad=True)]
Notice that fmodel.parameters() is same as fmodel.parameters(time=0): True
Notice how `out` is `x` multiplied by the latest version of weight: 0.417 * -0.9915 == -0.4135
Starting inner loop step j==1
Representation of fmodel.parameters(time=1): [tensor([[-0.1217]], dtype=torch.float64, grad_fn=<AddBackward0>)]
Notice that fmodel.parameters() is same as fmodel.parameters(time=1): True
Notice how `out` is `x` multiplied by the latest version of weight: 0.417 * -0.1217 == -0.05075
Starting inner loop step j==2
Representation of fmodel.parameters(time=2): [tensor([[1.0145]], dtype=torch.float64, grad_fn=<AddBackward0>)]
Notice that fmodel.parameters() is same as fmodel.parameters(time=2): True
Notice how `out` is `x` multiplied by the latest version of weight: 0.417 * 1.015 == 0.4231
Starting inner loop step j==3
Representation of fmodel.parameters(time=3): [tensor([[2.0640]], dtype=torch.float64, grad_fn=<AddBackward0>)]
Notice that fmodel.parameters() is same as fmodel.parameters(time=3): True
Notice how `out` is `x` multiplied by the latest version of weight: 0.417 * 2.064 == 0.8607
Starting inner loop step j==4
Representation of fmodel.parameters(time=4): [tensor([[2.8668]], dtype=torch.float64, grad_fn=<AddBackward0>)]
Notice that fmodel.parameters() is same as fmodel.parameters(time=4): True
Notice how `out` is `x` multiplied by the latest version of weight: 0.417 * 2.867 == 1.196
Let's print all intermediate parameters versions after inner loop is done:
For j==0 parameter is: [tensor([[-0.9915]], dtype=torch.float64, requires_grad=True)]
For j==1 parameter is: [tensor([[-0.1217]], dtype=torch.float64, grad_fn=<AddBackward0>)]
For j==2 parameter is: [tensor([[1.0145]], dtype=torch.float64, grad_fn=<AddBackward0>)]
For j==3 parameter is: [tensor([[2.0640]], dtype=torch.float64, grad_fn=<AddBackward0>)]
For j==4 parameter is: [tensor([[2.8668]], dtype=torch.float64, grad_fn=<AddBackward0>)]
For j==5 parameter is: [tensor([[3.3908]], dtype=torch.float64, grad_fn=<AddBackward0>)]
Final meta-loss: 0.011927987982895929
Gradient of final loss we got for lr and momentum: tensor([-1.6295]) and tensor([-0.9496])
If you change number of iterations "loops" to much larger number final loss will be stable and the values above will be smaller
Let's see if we got any gradient for initial model parameters: None
=================== Run Inner Loop Second Time (copy_initial_weights=False) =================
Final meta-loss: 0.011927987982895929
Let's see if we got any gradient for initial model parameters: tensor([[-0.0053]], dtype=torch.float64)
=================== Run Inner Loop Third Time (copy_initial_weights=False) =================
Final meta-loss: 0.01192798770078706
Side-by-side meta_loss_gradient_approximation and gradient computed by `higher` lib: -0.005311 VS -0.005311
这篇关于Pytorch的高级库中的copy_initial_weights文档是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!