如何理解RL中的近距离策略优化算法? [英] What is the way to understand Proximal Policy Optimization Algorithm in RL?

查看:155
本文介绍了如何理解RL中的近距离策略优化算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我了解强化学习的基础知识,但是要阅读 arxiv PPO论文吗?

I know the basics of Reinforcement Learning, but what terms it's necessary to understand to be able read arxiv PPO paper ?

学习和使用 PPO 的路线图是什么?

What is the roadmap to learn and use PPO ?

推荐答案

为更好地理解PPO,研究论文的主要贡献是有帮助的,它们是:(1)替代目标,并(2)使用随机梯度上升的多个时期来执行每个策略更新".

To better understand PPO, it is helpful to look at the main contributions of the paper, which are: (1) the Clipped Surrogate Objective and (2) the use of "multiple epochs of stochastic gradient ascent to perform each policy update".


从原始的 PPO纸:


From the original PPO paper:

我们引入了[PPO],这是一系列策略优化方法,它们使用多个随机梯度上升时间来执行每次策略更新.这些方法具有信任区域[ TRPO ]方法的稳定性和可靠性,但实现起来要简单得多,只需几行代码就可以更改为原始的策略梯度实现,适用于更常规的设置(例如,当对策略和价值功能使用联合体系结构时),并且具有更好的总体性能.

We have introduced [PPO], a family of policy optimization methods that use multiple epochs of stochastic gradient ascent to perform each policy update. These methods have the stability and reliability of trust-region [TRPO] methods but are much simpler to implement, requiring only a few lines of code change to a vanilla policy gradient implementation, applicable in more general settings (for example, when using a joint architecture for the policy and value function), and have better overall performance.


1.裁剪后的替代物镜

受限制的替代目标是策略梯度目标的替代品,该目标旨在通过限制您在每一步对策略的更改来提高培训稳定性.


1. The Clipped Surrogate Objective

The Clipped Surrogate Objective is a drop-in replacement for the policy gradient objective that is designed to improve training stability by limiting the change you make to your policy at each step.

对于原始的政策梯度(例如REINFORCE)---您应该熟悉,或者

For vanilla policy gradients (e.g., REINFORCE) --- which you should be familiar with, or familiarize yourself with before you read this --- the objective used to optimize the neural network looks like:

这是您在萨顿书中看到的标准公式,和其他 GAE ).通过针对网络参数的这种损失采取梯度上升步骤,您将激励那些导致更高报酬的行动.

This is the standard formula that you would see in the Sutton book, and other resources, where the A-hat could be the discounted return (as in REINFORCE) or the advantage function (as in GAE) for example. By taking a gradient ascent step on this loss with respect to the network parameters, you will incentivize the actions that led to higher reward.

香草策略梯度法使用操作的对数概率(logπ(a | s))来跟踪操作的影响,但是您可以想象使用另一个函数来执行此操作. 本文

The vanilla policy gradient method uses the log probability of your action (log π(a | s)) to trace the impact of the actions, but you could imagine using another function to do this. Another such function, introduced in this paper, uses the probability of the action under the current policy (π(a|s)), divided by the probability of the action under your previous policy (π_old(a|s)). This looks a bit similar to importance sampling if you are familiar with that:

当您的 current 政策可能比您的 old 行动 more 时,此r(θ)将大于1.政策;如果您当前的策略执行的操作比旧策略执行的可能性小,则该值将在0到1之间.

This r(θ) will be greater than 1 when the action is more probable for your current policy than it is for your old policy; it will be between 0 and 1 when the action is less probable for your current policy than for your old.

现在要用这个r(θ)建立一个目标函数,我们可以简单地将它换成对数π(a | s)项.这是在TRPO中完成的操作:

Now to build an objective function with this r(θ), we can simply swap it in for the log π(a|s) term. This is what is done in TRPO:

但是,如果您的行动在当前政策下更有可能发生(例如多出100倍),会发生什么呢? r(θ)会变得非常大,并导致采取较大的梯度步骤,可能会破坏您的政策.为了解决此问题和其他问题,TRPO增加了一些额外的限制(例如KL Divergence约束),以限制该政策可以更改的数量并有助于确保其单调改进.

But what would happen here if your action is much more probable (like 100x more) for your current policy? r(θ) will tend to be really big and lead to taking big gradient steps that might wreck your policy. To deal with this and other issues, TRPO adds several extra bells and whistles (e.g., KL Divergence constraints) to limit the amount the policy can change and help guarantee that it is monotonically improving.

我们可以将这些属性构建到目标函数中,而不是添加所有这些多余的钟声和哨子,怎么办?事实证明,这就是PPO所做的.通过优化这种简单(但看起来很有趣)的替代代理目标,它可以获得相同的性能优势并避免了复杂性:

Instead of adding all these extra bells and whistles, what if we could build these properties into the objective function? As it turns out, this is what PPO does. It gains the same performance benefits and avoids the complications by optimizing this simple (but kind of funny looking) Clipped Surrogate Objective:

最小化中的第一项(蓝色)与我们在TRPO目标中看到的(r(θ)A)项相同.第二项(红色)是将(r(θ))限制在(1-e,1 + e)之间的形式. (在论文中,他们指出e的一个好的值约为0.2,因此r可以在〜(0.8,1.2)之间变化).然后,最后,将这两个术语的最小化(绿色).

The first term (blue) inside the minimization is the same (r(θ)A) term we saw in the TRPO objective. The second term (red) is a version where the (r(θ)) is clipped between (1 - e, 1 + e). (in the paper they state a good value for e is about 0.2, so r can vary between ~(0.8, 1.2)). Then, finally, the minimization of both of these terms is taken (green).

花点时间仔细研究方程式,确保您知道所有符号的含义以及数学上的含义.查看代码也可能有所帮助;这是OpenAI中的相关部分基线 anyrl-py 实现.

Take your time and look at the equation carefully and make sure you know what all the symbols mean, and mathematically what is happening. Looking at the code may also help; here is the relevant section in both the OpenAI baselines and anyrl-py implementations.

太好了.

接下来,让我们看看L剪辑功能会产生什么效果.这是论文中的图表,绘制了当Advantage为正值和负值时裁剪目标的值:

Next, let's see what effect the L clip function creates. Here is a diagram from the paper that plots the value of the clip objective for when the Advantage is positive and negative:

在图的左半部(A> 0),这是该动作对结果产生积极影响的地方.在图的右半部分(A <0),这是该操作对结果产生负面影响的地方.

On the left half of the diagram, where (A > 0), this is where the action had an estimated positive effect on the outcome. On the right half of the diagram, where (A < 0), this is where the action had an estimated negative effect on the outcome.

请注意,如果r值太大,则在左半部分会被裁剪.如果在当前政策下采取的行动比在旧政策下更有可能采取的行动,就会发生这种情况.发生这种情况时,我们不想贪婪地走得太远(因为这只是我们的政策的局部近似值和示例,因此如果走得太远就不会准确),因此我们限制了目标以防止从成长. (这将在向后传递中阻止渐变-导致渐变为0的扁平线).

Notice how on the left half, the r-value gets clipped if it gets too high. This will happen if the action became a lot more probable under the current policy than it was for the old policy. When this happens, we do not want to get greedy and step too far (because this is just a local approximation and sample of our policy, so it will not be accurate if we step too far), and so we clip the objective to prevent it from growing. (This will have the effect in the backward pass of blocking the gradient --- the flat line causing the gradient to be 0).

在图的右侧,该动作对结果产生了估计的效果,我们看到该片段在0附近被激活,这在当前政策下不太可能执行.同样,该裁剪区域将阻止我们在已经采取了很大的步骤使其不太可能发生之后,就对其进行过多的更新以使该操作不太可能发生.

On the right side of the diagram, where the action had an estimated negative effect on the outcome, we see that the clip activates near 0, where the action under the current policy is unlikely. This clipping region will similarly prevent us from updating too much to make the action much less probable after we already just took a big step to make it less probable.

因此,我们看到这两个裁剪区域都阻止我们变得过于贪婪,并试图一次过多地进行更新,从而使该样本能够提供良好估计的区域.

So we see that both of these clipping regions prevent us from getting too greedy and trying to update too much at once and leaving the region where this sample offers a good estimate.

但是为什么我们让r(θ)在图的最右边无限期地增长呢?乍一看似乎很奇怪,但是在这种情况下会导致r(θ)变得很大吗?在该区域中r(θ)的增长将由使我们的动作很多的梯度阶跃引起.更有可能,结果使我们的政策变得更糟.如果是这样,我们希望能够撤消该渐变步骤.恰好是L剪辑功能允许这样做.该函数在此处为负,因此坡度将告诉我们沿另一个方向行走,并且使动作发生的可能性与我们将其拧紧的程度成正比. (请注意,该图的最左侧有一个相似的区域,该区域的动作很好,我们无意间使该动作不太可能发生.)

But why are we letting the r(θ) grow indefinitely on the far right side of the diagram? This seems odd as first, but what would cause r(θ) to grow really large in this case? r(θ) growth in this region will be caused by a gradient step that made our action a lot more probable, and it turning out to make our policy worse. If that was the case, we would want to be able to undo that gradient step. And it just so happens that the L clip function allows this. The function is negative here, so the gradient will tell us to walk the other direction and make the action less probable by an amount proportional to how much we screwed it up. (Note that there is a similar region on the far left side of the diagram, where the action is good and we accidentally made it less probable.)

这些撤消"区域解释了为什么我们必须在目标函数中包括怪异的最小化项.它们对应于未修剪的r(θ)A,其值比修剪的版本低,并通过最小化返回.这是因为它们朝着错误的方向迈出了一步(例如,行动很好,但我们偶然降低了行动的可能性).如果我们没有在目标函数中包含最小值,则这些区域将是平坦的(梯度= 0),并且将防止我们纠正错误.

These "undo" regions explain why we must include the weird minimization term in the objective function. They correspond to the unclipped r(θ)A having a lower value than the clipped version and getting returned by the minimization. This is because they were steps in the wrong direction (e.g., the action was good but we accidentally made it less probable). If we had not included the min in the objective function, these regions would be flat (gradient = 0) and we would be prevented from fixing mistakes.

这是一个总结此图的图:

Here is a diagram summarizing this:

这就是要点.限用替代目标只是您可以在常规政策梯度中使用的替代产品.裁剪限制了您可以在每个步骤中进行的有效更改,以提高稳定性,而最小化则使我们可以纠正错误,以防万一.我没有讨论的一件事是PPO目标形成本文中讨论的下限"是什么意思.有关更多信息,我建议作者撰写演讲的这部分.

And that is the gist of it. The Clipped Surrogate Objective is just a drop-in replacement you could use in the vanilla policy gradient. The clipping limits the effective change you can make at each step in order to improve stability, and the minimization allows us to fix our mistakes in case we screwed it up. One thing I didn't discuss is what is meant by PPO objective forming a "lower bound" as discussed in the paper. For more on that, I would suggest this part of a lecture the author gave.

与普通策略梯度方法不同,由于受限制的替代目标函数,PPO允许您在样本上运行多个梯度上升纪元,而不会造成破坏性的大的策略更新.这使您可以从数据中提取更多信息,并减少样本效率低下的情况.

Unlike vanilla policy gradient methods, and because of the Clipped Surrogate Objective function, PPO allows you to run multiple epochs of gradient ascent on your samples without causing destructively large policy updates. This allows you to squeeze more out of your data and reduce sample inefficiency.

PPO使用每个收集数据的 N 个并行参与者来运行策略,然后对这些数据的微型批次进行采样,以使用Clipped Surrogate Objective函数训练 K 个时期. .请参阅下面的完整算法(大约参数值是: K = 3-15, M = 64-4096, T (水平)= 128 -2048):

PPO runs the policy using N parallel actors each collecting data, and then it samples mini-batches of this data to train for K epochs using the Clipped Surrogate Objective function. See full algorithm below (the approximate param values are: K = 3-15, M = 64-4096, T (horizon) = 128-2048):

并行参与者部分已由 A3C论文普及,并已成为一种相当标准的收集方式数据.

The parallel actors part was popularized by the A3C paper and has become a fairly standard way for collecting data.

新颖的部分是,它们能够在轨迹样本上运行 K 个梯度上升纪元.正如他们在论文中指出的那样,最好对数据进行多次传递来运行香草策略梯度优化,以便您可以从每个样本中学到更多信息.但是,这对于香草方法在实践中通常是失败的,因为它们对本地样本采取了太多的步骤,这破坏了政策.另一方面,PPO具有内置机制来防止过多的更新.

The newish part is that they are able to run K epochs of gradient ascent on the trajectory samples. As they state in the paper, it would be nice to run the vanilla policy gradient optimization for multiple passes over the data so that you could learn more from each sample. However, this generally fails in practice for vanilla methods because they take too big of steps on the local samples and this wrecks the policy. PPO, on the other hand, has the built-in mechanism to prevent too much of an update.

对于每次迭代,在使用π_old(第3行)对环境进行采样后,当我们开始运行优化(第6行)时,我们的策略π将完全等于π_old.因此,起初,我们的更新都不会被裁剪,因此我们保证可以从这些示例中学到一些东西.但是,当我们使用多个时期更新π时,目标将开始达到剪切极限,这些样本的梯度将变为0,并且训练将逐渐停止...直到我们进行下一次迭代并收集新样本

For each iteration, after sampling the environment with π_old (line 3) and when we start running the optimization (line 6), our policy π will be exactly equal to π_old. So at first, none of our updates will be clipped and we are guaranteed to learn something from these examples. However, as we update π using multiple epochs, the objective will start hitting the clipping limits, the gradient will go to 0 for those samples, and the training will gradually stop...until we move on to the next iteration and collect new samples.

....

仅此而已.如果您有兴趣获得更好的了解,建议您深入研究原始论文,自己实施,或深入研究基准实施和玩代码.

And that's all for now. If you are interested in gaining a better understanding, I would recommend digging more into the original paper, trying to implement it yourself, or diving into the baselines implementation and playing with the code.

[edit:2019/01/27]:为了获得更好的背景以及PPO与其他RL算法的关系,我也强烈建议您查看OpenAI的

[edit: 2019/01/27]: For a better background and for how PPO relates to other RL algorithms, I would also strongly recommend checking out OpenAI's Spinning Up resources and implementations.

这篇关于如何理解RL中的近距离策略优化算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆