DDPG(深度确定性策略梯度),参与者如何更新? [英] DDPG (Deep Deterministic Policy Gradients), how is the actor updated?

查看:853
本文介绍了DDPG(深度确定性策略梯度),参与者如何更新?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试在Keras中实施DDPG.我知道如何更新评论者网络(常规DQN算法),但是我目前仍停留在更新参与者网络上,该参与者网络使用以下公式:

I'm currently trying to implement DDPG in Keras. I know how to update the critic network (normal DQN algorithm), but I'm currently stuck on updating the actor network, which uses the equation:

因此,为了将actor网络的损失减少到其权重dJ/dtheta,它使用链规则来获取dQ/da(来自评论者网络)* da/dtheta(来自actor网络).

so in order to reduce the loss of the actor network wrt to its weight dJ/dtheta, it's using chain rule to get dQ/da (from critic network) * da/dtheta (from actor network).

这看起来不错,但是我很难理解如何从这两个网络中导出渐变.有人可以向我解释这部分吗?

This looks fine, but I'm having trouble understanding how to derive the gradients from those 2 networks. Could someone perhaps explain this part to me?

推荐答案

因此,主要直觉是,J是您要最大化而不是最小化的东西.因此,我们可以称其为目标函数而不是损失函数.该方程式简化为:

So the main intuition is that here, J is something you want to maximize instead of minimize. Therefore, we can call it an objective function instead of a loss function. The equation simplifies down to:

dJ/dTheta = dQ/da * da/dTheta = dQ/dTheta

dJ/dTheta = dQ / da * da / dTheta = dQ/dTheta

这意味着您想更改参数Theta来更改Q.由于在RL中,我们希望最大化Q,对于这一部分,我们希望进行梯度上升.为此,您只需执行梯度下降操作,除了将梯度作为负值输入即可.

Meaning you want to change the parameters Theta to change Q. Since in RL, we want to maximize Q, for this part, we want to do gradient ascent instead. To do this, you just perform gradient descent, except feed the gradients as negative values.

要导出渐变,请执行以下操作:

To derive the gradients, do the following:

  1. 使用在线actor网络,发送从重放内存中采样的一批状态. (用于训练评论家的同一批人)
  2. 计算每个状态的确定性动作
  3. 将用于计算这些动作的状态发送到在线评论家网络,以将这些确切状态映射到Q值.
  4. 相对于在步骤2中计算的动作,计算Q值的梯度.我们可以使用tf.gradients(Q值,动作)来执行此操作.现在,我们有了dQ/dA.
  5. 再次将状态发送给演员在线评论家,并将其映射到动作.
  6. 再次使用tf.gradients(a,network_weights)计算相对于在线角色网络权重的动作梯度.这将为您提供dA/dTheta
  7. 将dQ/dA乘以-dA/dTheta以得到 GRADIENT ASCENT .我们剩下目标函数的梯度,即梯度J
  8. 按批次大小(即

  1. Using the online actor network, send in a batch of states that was sampled from your replay memory. (The same batch used to train the critic)
  2. Calculate the deterministic action for each of those states
  3. Send the states used to calculate those actions to the online critic network to map those exact states to Q values.
  4. Calculate the gradient of the Q values with respect with the actions calculated in step 2. We can use tf.gradients(Q value, actions) to do this. Now, we have dQ/dA.
  5. Send the states to the actor online critic again and map it to actions.
  6. Calculate the gradient of the actions with respect to the online actor network weights, again using tf.gradients(a, network_weights). This will give you dA/dTheta
  7. Multiply dQ/dA by -dA/dTheta to get GRADIENT ASCENT. We are left with the gradient of the objective function, i.e., gradient J
  8. Divide all elements of gradient J by the batch size, i.e.,

对于J中的j,

 j / batch size

  • 首先通过使用网络参数压缩梯度J来应用梯度下降的变体.可以使用tf.apply_gradients(zip(J,network_params))
  • 完成
  • ba,你的演员正在训练有关最大化Q的参数.
  • Apply a variant of gradient descent by first zipping gradient J with the network parameters. This can be done using tf.apply_gradients(zip(J, network_params))
  • And bam, your actor is training its parameters with respect to maximizing Q.
  • 我希望这是有道理的!我也很难理解这个概念,并且说实话,有些地方还是有些模糊.让我知道是否可以澄清任何事情!

    I hope this makes sense! I also had a hard time understanding this concept, and am still a little fuzzy on some parts to be completely honest. Let me know if I can clarify anything!

    这篇关于DDPG(深度确定性策略梯度),参与者如何更新?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆