Q学习和SARSA有什么区别? [英] What is the difference between Q-learning and SARSA?

查看：281 发布时间：2020/9/7 18:53:18 artificial-intelligence reinforcement-learning q-learning sarsa

本文介绍了Q学习和SARSA有什么区别?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

尽管我知道 SARSA 处于策略状态，而 Q学习是不合政策的，在查看它们的公式时(对我而言)很难看到这两种算法之间的任何区别.

Although I know that SARSA is on-policy while Q-learning is off-policy, when looking at their formulas it's hard (to me) to see any difference between these two algorithms.

根据这本书《强化学习:简介》 (由Sutton和Barto撰写).在SARSA算法中，给定一个策略，相应的动作值函数Q(在状态s和动作a，在时间步长t)，即Q(s _t，a _t)，可以进行如下更新

According to the book Reinforcement Learning: An Introduction (by Sutton and Barto). In the SARSA algorithm, given a policy, the corresponding action-value function Q (in the state s and action a, at timestep t), i.e. Q(s_t, a_t), can be updated as follows

Q(s _t，a _t)= Q(s _t，a _t)+α *(r _t +γ* Q(s _{t + 1}，a _{t + 1})-Q(s _t，a _t))

Q(s_t, a_t) = Q(s_t, a_t) + α*(r_t + γ*Q(s_t+1, a_t+1) - Q(s_t, a_t))

另一方面，Q学习算法的更新步骤如下

On the other hand, the update step for the Q-learning algorithm is the following

Q(s _t，a _t)= Q(s _t，a _t)+α *(r _t +γ* max _a Q(s _{t + 1}，a)-Q(s _t，a _t))

Q(s_t, a_t) = Q(s_t, a_t) + α*(r_t + γ*max_a Q(s_t+1, a) - Q(s_t, a_t))

也可以写为

Q(s _t，a _t)=(1-α)* Q(s _t，a _t)+α*(r _t +γ* max _a Q(s _{t + 1}，a))

Q(s_t, a_t) = (1 - α) * Q(s_t, a_t) + α * (r_t + γ*max_a Q(s_t+1, a))

其中γ(伽玛)是折扣因子，r _t是在时间步长t从环境中获得的奖励.

where γ (gamma) is the discount factor and r_t is the reward received from the environment at timestep t.

这两种算法之间的区别是，SARSA仅查找下一个策略值，而Q学习查找下一个最大策略值是一个事实吗?

Is the difference between these two algorithms the fact that SARSA only looks up the next policy value while Q-learning looks up the next maximum policy value?

TLDR(和我自己的答案)

感谢所有自从我第一次提出这个问题以来回答这个问题的人.我制作了一个 github repo 和Q-Learning一起玩，并凭经验了解了两者之间的区别.这完全等于您 选择下一个最佳动作 的方式，从算法的角度来看，这可以是 mean ， max 或最佳操作，具体取决于您选择的实施方式.

Thanks to all those answering this question since I first asked it. I've made a github repo playing with Q-Learning and empirically understood what the difference is. It all amounts to how you select your next best action, which from an algorithmic standpoint can be a mean, max or best action depending on how you chose to implement it.

另一个主要区别是何时进行选择(例如，在线与离线)以及影响学习的方式/原因.如果您在2019年阅读此书并且更喜欢动手，那么玩RL玩具问题可能是了解差异的最佳方法.

The other main difference is when this selection is happening (e.g., online vs offline) and how/why that affects learning. If you are reading this in 2019 and are more of a hands-on person, playing with a RL toy problem is probably the best way to understand the differences.

最后重要说明是Suton&关于下一个状态的最佳/最大动作和奖励，Barto以及Wikipedia经常出现混合，令人困惑的或错误公式化表示形式:

One last important note is that both Suton & Barto as well as Wikipedia often have mixed, confusing or wrong formulaic representations with regards to the next state best/max action and reward:

r(t + 1)

r(t+1)

实际上是

r(t)

希望这可以帮助任何陷入困境的人.

Hope this helps anyone ever getting stuck at this.

Q学习和SARSA有什么区别? [英] What is the difference between Q-learning and SARSA?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

Q学习和SARSA有什么区别? [英] What is the difference between Q-learning and SARSA?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭