Q-learning 和 SARSA 与贪婪选择等价吗? [英] Are Q-learning and SARSA with greedy selection equivalent?

查看:68
本文介绍了Q-learning 和 SARSA 与贪婪选择等价吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Q-learning 和 SARSA 的区别在于 Q-learning 比较当前状态和可能的最佳下一个状态,而 SARSA 比较当前状态和实际下一个状态.

The difference between Q-learning and SARSA is that Q-learning compares the current state and the best possible next state, whereas SARSA compares the current state against the actual next state.

如果使用贪心选择策略,即 100% 的时间选择具有最高动作值的动作,那么 SARSA 和 Q-learning 是否相同?

If a greedy selection policy is used, that is, the action with the highest action value is selected 100% of the time, are SARSA and Q-learning then identical?

推荐答案

好吧,实际上并非如此.SARSA 和 Q-learning 之间的一个主要区别在于,SARSA 是一种 on-policy 算法(它遵循正在学习的策略),而 Q-learning 是一种 off-policy 算法(它可以遵循任何策略(满足某些收敛要求).

Well, not actually. A key difference between SARSA and Q-learning is that SARSA is an on-policy algorithm (it follows the policy that is learning) and Q-learning is an off-policy algorithm (it can follow any policy (that fulfills some convergence requirements).

请注意,在以下两种算法的伪代码中,SARSA 选择 a' 和 s' 然后更新 Q 函数;而 Q-learning 首先更新 Q-function,下一个要执行的动作是在下一次迭代中选择的,从更新后的 Q-function 导出,不一定等于选择更新 Q 的 a'.

Notice that in the following pseudocode of both algorithms, that SARSA choose a' and s' and then updates the Q-function; while Q-learning first updates the Q-function, and the next action to perform is selected in the next iteration, derived from the updated Q-function and not necessarily equal to the a' selected to update Q.

无论如何,两种算法都需要探索(即采取与贪婪行动不同的行动)才能收敛.

In any case, both algorithms require exploration (i.e., taking actions different from the greedy action) to converge.

SARSA 和 Q-learning 的伪代码摘自 Sutton 和 Barto 的书:强化学习:简介(HTML 版本)

The pseudocode of SARSA and Q-learning have been extracted from Sutton and Barto's book: Reinforcement Learning: An Introduction (HTML version)

这篇关于Q-learning 和 SARSA 与贪婪选择等价吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆