在TensorFlow中将stop_gradient与AdamOptimizer一起使用 [英] Using stop_gradient with AdamOptimizer in TensorFlow

查看:135
本文介绍了在TensorFlow中将stop_gradient与AdamOptimizer一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当每次反向传播迭代中某些参数保持固定时,我正在尝试实现一个训练/微调框架.我希望能够在迭代之间更改一组更新或固定参数. TensorFlow方法 tf.stop_gradient ,该方法显然会强制某些参数的梯度保持零,对于此目的非常有用,并且如果更新或固定参数的集合在每次迭代之间均不变,则它与不同的优化器完美配合.如果与随机渐变配合使用,它还可以处理各种更新或固定参数集下降.我的问题是,与tf.stop_gradient无法处理此类情况. /a>.更具体地说,它在tf.compute_gradients的输出中的确将固定参数的梯度保持为零,但在应用梯度(tf.apply_gradients)时,固定参数的值确实发生了变化.我想这是因为即使梯度为零,Adam优化器中的优化步骤也不为零(基于

I am trying to implement a training/finetuning framework when in each backpropagation iteration a certain set of parameters stay fixed. I want to be able to change the set of updating or fixed parameters from iteration to iteration. TensorFlow method tf.stop_gradient, which apparently forces gradients of some parameters to stay zero, is very useful for this purpose and it works perfectly fine with different optimizers if the set of updating or fixed parameters do not change from iterations to iterations. It can also handle varying set of updating or fixed parameters if it is used with stochastic gradient descent. My problem is that tf.stop_gradient cannot handle such cases when being used with Adam optimizer. More specifically, it does keep the gradients of the fixed parameters at zero in the output of tf.compute_gradients, but when applying the gradients (tf.apply_gradients), value of the fixed parameters does change. I suppose this is because the optimiaztion step in Adam optimizer is not zero even if the gradient is zero (based on algorithm 1 in Kingma and Ba's paper). Is there a cheap way of freezing a variable set of parameters in each Adam iteration, without explicitly saving the previous iteration's values of the fixed parameters?

假设我有一个具有权重矩阵变量W和二进制掩码矩阵占位符MW的单层网络,该占位符指定W的哪些元素应在每次迭代中更新(的值1).我没有使用W编写该层的输入/输出关系,而是对其进行了如下修改

Suppose I have a single-layer network with weight matrix variable W and a binary mask matrix placeholder MW that specifies which elements of W should get updated in each iteration (value 1 in the ). Instead of using W to write the input/output relationship of this layer, I modify it as below

masked_W = MW*W + tf.stop_gradient(tf.abs(1-MW)*W)

掩盖W的某些元素以使其具有非零的渐变.然后,我使用masked_W形成该层的输出,因此,网络的丢失取决于此掩码变量.关键是MW在每次迭代中都会发生变化.假设W是一个由4个元素组成的向量,初始为全零向量.这是发生了什么:

to mask certain elements of W from having non-zero gradients. Then I use masked_W to form the output of the layer and consequently the loss of the network depends on this masked variable. The point is that MW changes in each iteration. Suppose W is a vector of 4 elements initialized to all-zero vector. Here is what happens:

opt=tf.AdamOptimizer(1e-5)
sess.run(tf.global_variables_initializer())
grads_vars=opt.compute_gradients(loss, W)

# initial value of W=[0,0,0,0]

# first iteration:
MW_val = [0,1,1,0]
feed_dict={MW:MW_val, x: batch_of_data, y_:batch_of_labels}
sess.run(opt.apply_gradients(grads_vars), feed_dict=feed_dict))
# gradient of  W=[0,xx,xx,0]
# new value of W=[0,a,b,0]

其中,xx是一些非零梯度值,而abW的更新元素的新值.在第二次迭代中,我们将分配给二进制掩码矩阵MW的值更改为[1,0,0,1],因此我们希望对W[1]W[2]具有固定值,并为W[0]更新值>和W[3].但这是发生了什么:

where xx are some non-zero gradient values, and a and b are new values of updating elements of W. In the second iteration, we change the value assigned to the binary mask matrix MW to [1,0,0,1], hence we expect to have fixed values for W[1] and W[2] and updating values for W[0] and W[3]. But this is what happens:

# second iteration
MW_val = [1,0,0,1]
feed_dict={MW:MW_val, x: batch_of_data, y_:batch_of_labels}
sess.run(opt.apply_gradients(grads_vars), feed_dict=feed_dict))
# gradient of  W=[xx,0,0,xx]
# new value of W=[c,aa,bb,d]

也就是说,尽管W[1]W[2]的梯度为零,但它们会获得新值(aa != abb != b).将优化器从Adam更改为SGD时,固定参数的值保持与预期相同.

That is, although the gradients of W[1] and W[2] are zero, they get new values (aa != a and bb != b). When changing the optimizer from Adam to SGD, the values of fixed parameters stay the same as expected.

推荐答案

我找到了我的问题的解决方案,并在此处分享了它,以防其他人发现它有用.在第一次迭代之后,在第一次迭代中已更新的那些参数的力矩已经不为零.因此,即使在第二次迭代中将其梯度设为零,由于其非零动量张量,它们也会被更新.为了防止更新,仅使用tf.stop_gradient是不够的,我们还必须删除它们的动力.对于Adam优化器,可以通过优化器的get_slot方法:opt.get_slot(par, 'm')opt.get_slot(par,'v')完成,其中前者和后者分别访问参数par的第一和第二动量张量.在问题的示例中,我们必须添加以下行以冻结第二次迭代中的W[1]W[2]:

I found a solution to my question and am sharing it here in case others will find it useful. After the first iteration, the moments of those parameters that had been updated in the first iteration are already non-zero. Therefore, even if one puts their gradients to zero in the second iteration, they will be updated because of their non-zero momentum tensors. In order to prevent the updates, only using tf.stop_gradient is not enough, we have to remove their momentum as well. In case of Adam optimizer, this can be done through get_slot method of the optimizer: opt.get_slot(par, 'm') and opt.get_slot(par,'v'), where the former and latter give access to the first and second momentum tensors of parameter par, respectively. In the example of the question, we have to add the following lines to freeze W[1] and W[2] in the second iteration:

# momentums of W in the first iteration
m_vals = opt.get_slot(W, 'm')
v_vals = opt.get_slot(W, 'v')
# mask them before running the second iteration
masked_m_vals[[1,2]]=0
masked_v_vals[[1,2]]=0
sess.run(opt.get_slot(W, 'm').assign(masked_m_vals))
sess.run(opt.get_slot(W, 'v').assign(masked_v_vals))

最好保存掩盖的动量,例如在m_vals[[1,2]]v_vals[[1,2]]上方的示例中,这样,如果在第三次迭代中放宽W[1]W[2]的固定约束,我们可以将其动量恢复为它们在第一次迭代中的原始值.

It is better to save the masked momentums, in example above m_vals[[1,2]] and v_vals[[1,2]], so that if in the third iteration we relax the fixing constraint of W[1] and W[2], we can restore their momentums to their original values in the first iteration.

这篇关于在TensorFlow中将stop_gradient与AdamOptimizer一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆