Q学习和价值迭代有什么区别? [英] What is the difference between Q-learning and Value Iteration?

查看:697
本文介绍了Q学习和价值迭代有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Q学习与强化学习中的值迭代有何不同?

How is Q-learning different from value iteration in reinforcement learning?

我知道Q学习是无模型的,训练样本是过渡(s, a, s', r).但是,由于我们知道Q学习中的过渡和每个过渡的收益,因此它与基于模型的学习不同,在模型学习中,我们了解状态和动作对的收益以及状态从每个行为的过渡(是它是随机的还是确定性的)?我不明白其中的区别.

I know Q-learning is model-free and training samples are transitions (s, a, s', r). But since we know the transitions and the reward for every transition in Q-learning, is it not the same as model-based learning where we know the reward for a state and action pair, and the transitions for every action from a state (be it stochastic or deterministic)? I do not understand the difference.

推荐答案

您是100%正确的,如果我们知道转换概率和Q学习中每个转换的奖励,那么我们将不清楚为什么要使用它基于模型的学习,或者甚至有根本的不同.毕竟,转移概率和奖励是价值迭代中使用的模型的两个组成部分-如果您拥有它们,那么您就有一个模型.

You are 100% right that if we knew the transition probabilities and reward for every transition in Q-learning, it would be pretty unclear why we would use it instead of model-based learning or how it would even be fundamentally different. After all, transition probabilities and rewards are the two components of the model used in value iteration - if you have them, you have a model.

关键在于,在Q学习中,代理不知道状态转换概率或奖励.代理仅发现通过给定动作从一种状态进入另一种状态并获得奖励时会获得奖励.同样,它只能通过结束给定状态并查看其选项来弄清楚从给定状态可以进行哪些转换.如果状态转换是随机的,它可以通过观察不同转换发生的频率来了解状态之间转换的可能性.

The key is that, in Q-learning, the agent does not know state transition probabilities or rewards. The agent only discovers that there is a reward for going from one state to another via a given action when it does so and receives a reward. Similarly, it only figures out what transitions are available from a given state by ending up in that state and looking at its options. If state transitions are stochastic, it learns the probability of transitioning between states by observing how frequently different transitions occur.

这里可能引起混乱的原因是,作为程序员,您可能确切地知道如何设置奖励和状态转换.实际上,当您初次设计系统时,您很有可能会这样做,因为这对于调试和验证您的方法是否有效非常重要.但是您绝不会告诉代理人任何事情-而是强迫其通过反复试验来独立学习. 如果您要创建一个能够进入您没有任何先验知识并弄清楚该怎么做的新情况的代理,这很重要.或者,如果您不在乎关于代理人自己学习的能力,如果状态空间太大而无法重复枚举,那么 Q学习也可能是必要的.

A possible source of confusion here is that you, as the programmer, might know exactly how rewards and state transitions are set up. In fact, when you're first designing a system, odds are that you do as this is pretty important to debugging and verifying that your approach works. But you never tell the agent any of this - instead you force it to learn on its own through trial and error. This is important if you want to create an agent that is capable of entering a new situation that you don't have any prior knowledge about and figuring out what to do. Alternately, if you don't care about the agent's ability to learn on its own, Q-learning might also be necessary if the state-space is too large to repeatedly enumerate. Having the agent explore without any starting knowledge can be more computationally tractable.

这篇关于Q学习和价值迭代有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆