价值迭代和策略迭代有什么区别? [英] What is the difference between value iteration and policy iteration?

查看:608
本文介绍了价值迭代和策略迭代有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在强化学习中,策略迭代值迭代有什么区别?

In reinforcement learning, what is the difference between policy iteration and value iteration?

据我了解,在值迭代中,您使用Bellman方程求解最优策略,而在策略迭代中,您随机选择一个策略π,并找到该策略的收益.

As much as I understand, in value iteration, you use the Bellman equation to solve for the optimal policy, whereas, in policy iteration, you randomly select a policy π, and find the reward of that policy.

我的疑问是,如果您在PI中选择随机策略π,那么即使我们选择多个随机策略,也如何保证它是最佳策略.

My doubt is that if you are selecting a random policy π in PI, how is it guaranteed to be the optimal policy, even if we are choosing several random policies.

推荐答案

让我们并排来看它们.比较的关键部分将突出显示.数字来自Sutton和Barto的书:强化学习:简介.

Let's look at them side by side. The key parts for comparison are highlighted. Figures are from Sutton and Barto's book: Reinforcement Learning: An Introduction.

重点:

  1. 政策迭代包括:政策评估 + 政策改进,并且反复重复两次,直到政策趋同.
  2. 价值迭代包括:寻找最优价值功能和一个政策提取.两者没有重复,因为一旦值函数最优,则其中的策略也应该最优(即收敛).
  3. 查找最优值函数也可以看作是策略改进(由于最大值)和截断的策略评估(仅对所有状态进行一次扫描后v_(s)的重新分配)的组合收敛).
  4. 策略评估寻找最优价值函数的算法非常相似,除了最大操作(突出显示)
  5. 类似地,政策改进政策提取的关键步骤是相同的​​,除了前者涉及稳定性检查.
  1. Policy iteration includes: policy evaluation + policy improvement, and the two are repeated iteratively until policy converges.
  2. Value iteration includes: finding optimal value function + one policy extraction. There is no repeat of the two because once the value function is optimal, then the policy out of it should also be optimal (i.e. converged).
  3. Finding optimal value function can also be seen as a combination of policy improvement (due to max) and truncated policy evaluation (the reassignment of v_(s) after just one sweep of all states regardless of convergence).
  4. The algorithms for policy evaluation and finding optimal value function are highly similar except for a max operation (as highlighted)
  5. Similarly, the key step to policy improvement and policy extraction are identical except the former involves a stability check.

根据我的经验,策略迭代值迭代快,因为策略的收敛速度比值函数快.我记得书中也对此进行了描述.

In my experience, policy iteration is faster than value iteration, as a policy converges more quickly than a value function. I remember this is also described in the book.

我想混乱主要来自所有这些相似的术语,这在我之前也使我感到困惑.

I guess the confusion mainly came from all these somewhat similar terms, which also confused me before.

这篇关于价值迭代和策略迭代有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆