价值迭代和策略迭代有什么区别? [英] What is the difference between value iteration and policy iteration?
问题描述
在强化学习中,策略迭代和值迭代有什么区别?
In reinforcement learning, what is the difference between policy iteration and value iteration?
据我了解,在值迭代中,您使用Bellman方程求解最优策略,而在策略迭代中,您随机选择一个策略π,并找到该策略的收益.
As much as I understand, in value iteration, you use the Bellman equation to solve for the optimal policy, whereas, in policy iteration, you randomly select a policy π, and find the reward of that policy.
我的疑问是,如果您在PI中选择随机策略π,那么即使我们选择多个随机策略,也如何保证它是最佳策略.
My doubt is that if you are selecting a random policy π in PI, how is it guaranteed to be the optimal policy, even if we are choosing several random policies.
推荐答案
让我们并排来看它们.比较的关键部分将突出显示.数字来自Sutton和Barto的书:强化学习:简介.
Let's look at them side by side. The key parts for comparison are highlighted. Figures are from Sutton and Barto's book: Reinforcement Learning: An Introduction.
重点:
- 政策迭代包括:政策评估 + 政策改进,并且反复重复两次,直到政策趋同.
- 价值迭代包括:寻找最优价值功能和一个政策提取.两者没有重复,因为一旦值函数最优,则其中的策略也应该最优(即收敛).
- 查找最优值函数也可以看作是策略改进(由于最大值)和截断的策略评估(仅对所有状态进行一次扫描后v_(s)的重新分配)的组合收敛).
- 策略评估和寻找最优价值函数的算法非常相似,除了最大操作(突出显示)
- 类似地,政策改进和政策提取的关键步骤是相同的,除了前者涉及稳定性检查.
- Policy iteration includes: policy evaluation + policy improvement, and the two are repeated iteratively until policy converges.
- Value iteration includes: finding optimal value function + one policy extraction. There is no repeat of the two because once the value function is optimal, then the policy out of it should also be optimal (i.e. converged).
- Finding optimal value function can also be seen as a combination of policy improvement (due to max) and truncated policy evaluation (the reassignment of v_(s) after just one sweep of all states regardless of convergence).
- The algorithms for policy evaluation and finding optimal value function are highly similar except for a max operation (as highlighted)
- Similarly, the key step to policy improvement and policy extraction are identical except the former involves a stability check.
根据我的经验,策略迭代比值迭代快,因为策略的收敛速度比值函数快.我记得书中也对此进行了描述.
In my experience, policy iteration is faster than value iteration, as a policy converges more quickly than a value function. I remember this is also described in the book.
我想混乱主要来自所有这些相似的术语,这在我之前也使我感到困惑.
I guess the confusion mainly came from all these somewhat similar terms, which also confused me before.
这篇关于价值迭代和策略迭代有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!