Q学习和SARSA有什么区别? [英] What is the difference between Q-learning and SARSA?

查看:281
本文介绍了Q学习和SARSA有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尽管我知道 SARSA 处于策略状态,而 Q学习是不合政策的,在查看它们的公式时(对我而言)很难看到这两种算法之间的任何区别.

Although I know that SARSA is on-policy while Q-learning is off-policy, when looking at their formulas it's hard (to me) to see any difference between these two algorithms.

根据这本书《强化学习:简介》 (由Sutton和Barto撰写).在SARSA算法中,给定一个策略,相应的动作值函数Q(在状态s和动作a,在时间步长t),即Q(s t ,a t ),可以进行如下更新

According to the book Reinforcement Learning: An Introduction (by Sutton and Barto). In the SARSA algorithm, given a policy, the corresponding action-value function Q (in the state s and action a, at timestep t), i.e. Q(st, at), can be updated as follows

Q(s t ,a t )= Q(s t ,a t )+α *(r t +γ* Q(s t + 1 ,a t + 1 )-Q(s t ,a t ))

Q(st, at) = Q(st, at) + α*(rt + γ*Q(st+1, at+1) - Q(st, at))

另一方面,Q学习算法的更新步骤如下

On the other hand, the update step for the Q-learning algorithm is the following

Q(s t ,a t )= Q(s t ,a t )+α *(r t +γ* max a Q(s t + 1 ,a)-Q(s t ,a t ))

Q(st, at) = Q(st, at) + α*(rt + γ*maxa Q(st+1, a) - Q(st, at))

也可以写为

Q(s t ,a t )=(1-α)* Q(s t ,a t )+α*(r t +γ* max a Q(s t + 1 ,a))

Q(st, at) = (1 - α) * Q(st, at) + α * (rt + γ*maxa Q(st+1, a))

其中γ(伽玛)是折扣因子,r t 是在时间步长t从环境中获得的奖励.

where γ (gamma) is the discount factor and rt is the reward received from the environment at timestep t.

这两种算法之间的区别是,SARSA仅查找下一个策略值,而Q学习查找下一个最大策略值是一个事实吗?

Is the difference between these two algorithms the fact that SARSA only looks up the next policy value while Q-learning looks up the next maximum policy value?

TLDR(和我自己的答案)

感谢所有自从我第一次提出这个问题以来回答这个问题的人.我制作了一个 github repo 和Q-Learning一起玩,并凭经验了解了两者之间的区别.这完全等于您 选择下一个最佳动作 的方式,从算法的角度来看,这可以是 mean max 最佳操作,具体取决于您选择的实施方式.

Thanks to all those answering this question since I first asked it. I've made a github repo playing with Q-Learning and empirically understood what the difference is. It all amounts to how you select your next best action, which from an algorithmic standpoint can be a mean, max or best action depending on how you chose to implement it.

另一个主要区别是何时进行选择(例如,在线离线)以及影响学习的方式/原因.如果您在2019年阅读此书并且更喜欢动手,那么玩RL玩具问题可能是了解差异的最佳方法.

The other main difference is when this selection is happening (e.g., online vs offline) and how/why that affects learning. If you are reading this in 2019 and are more of a hands-on person, playing with a RL toy problem is probably the best way to understand the differences.

最后重要说明是Suton&关于下一个状态的最佳/最大动作和奖励,Barto以及Wikipedia经常出现混合,令人困惑的错误公式化表示形式:

One last important note is that both Suton & Barto as well as Wikipedia often have mixed, confusing or wrong formulaic representations with regards to the next state best/max action and reward:

r(t + 1)

r(t+1)

实际上是

r(t)

希望这可以帮助任何陷入困境的人.

Hope this helps anyone ever getting stuck at this.

推荐答案

是的,这是唯一的区别.策略上SARSA会根据其遵循的策略来学习操作值,而策略外Q-Learning则相对于贪婪策略来学习操作值.在某些常见条件下,它们都收敛于实值函数,但速率不同. Q-Learning趋向于收敛较慢,但具有在更改政策的同时继续学习的能力.此外,结合线性逼近,不能保证Q学习能够收敛.

Yes, this is the only difference. On-policy SARSA learns action values relative to the policy it follows, while off-policy Q-Learning does it relative to the greedy policy. Under some common conditions, they both converge to the real value function, but at different rates. Q-Learning tends to converge a little slower, but has the capabilitiy to continue learning while changing policies. Also, Q-Learning is not guaranteed to converge when combined with linear approximation.

实际上,在ε贪婪策略下,Q-Learning计算Q(s,a)与最大作用值之间的差,而SARSA计算Q(s,a)与A的加权和之间的差.平均动作值和最大值:

In practical terms, under the ε-greedy policy, Q-Learning computes the difference between Q(s,a) and the maximum action value, while SARSA computes the difference between Q(s,a) and the weighted sum of the average action value and the maximum:

Q学习:Q(s t + 1 ,a t + 1 )= max a Q(s t +1 ,a)

Q-Learning: Q(st+1,at+1) = maxaQ(st+1,a)

SARSA:Q(s t + 1 ,a t + 1 )=ε·平均值 a Q(s t +1 ,a)+(1-ε)·max a Q(s t + 1 ,a)

SARSA: Q(st+1,at+1) = ε·meanaQ(st+1,a) + (1-ε)·maxaQ(st+1,a)

这篇关于Q学习和SARSA有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆