Q-Learning值太高 [英] Q-Learning values get too high
问题描述
以下是我如何实现m,n, K游戏环境:
在每个给定时间 t
,代理保存最后的状态动作(s,a)
和获得的奖励;该代理根据Epsilon-greedy策略选择一个移动 a'
并计算奖励 r
,然后继续更新时间 t-1
$ b的值 Q(s,a)
$ b
FUNC(剂* RLAgent)学会(奖励float64){
和
变种mState = marshallState(agent.prevState,agent.id)
变种OLDVAL = agent.values [mState]
agent.values [mState] = OLDVAL +(agent.LearningRate *
(agent.prevScore +(agent.DiscountFactor *奖励) - OLDVAL))
$ ul
agent.prevState
在采取行动之后以及在环境响应之前(即在代理人移动之后,在另一个玩家做出移动)我使用它来代替状态动作元组,但是我不太确定这是否正确的方法
agent.prevScore
对以前的状态行动持有奖励
奖励
参数代表当前步骤的状态动作(Qmax
)
agent.LearningRate = 0.2
agent.DiscountFactor = 0.8 $由于状态行为值溢出,代理未能达到10万事件。
我正在使用golang的float64
(标准IEEE 754-1985双精度浮点变量),它在±1.80×10 ^ 308
并产生±英菲尼迪
。这个值太大了我会说!
以下是学习率为
0.02
以及通过2M集(本身1M游戏)获得的0.08
的折扣因子:
强化学习模型报告
迭代:2000000
得知状态:4973
最大值:88781786878142287058992045692178302709335321375413536179603017129368394119653322992958428880260210391115335655910912645569618040471973513955473468092393367618971462560382976.000000
最小值:0.000000
$ C $
奖励函数返回:
- 代理赢了: 1
- 损失代理:-1
- 绘制:0
- 游戏继续:0.5
但是你可以看到最小值为零,最大值太高。
可能值得一提的是,用一个简单的学习方法,我发现在Python脚本中完美无瑕你好,感觉实际上更聪明!当我玩这个游戏的时候,大部分时间的结果都是平分的(如果我不小心玩的话,这个结果甚至会赢),而使用标准的Q-Learning方法,我甚至不能让它赢得胜利!
agent.values [mState] = oldVal +(agent.LearningRate *(reward - agent.prevScore))
有关如何解决此问题的任何想法?
在Q-Learning中,这种状态行为的价值是否正常?!
更新:
在阅读了Pablo的回答以及Nick提供给这个问题的轻微而重要的编辑之后,我意识到问题是prevScore
,其中包含Q值上一步(等于oldVal
),而不是前一步的奖励(在本例中为-1,0,0.5或1)。
在这个变化之后,代理现在正常运行,在2M集之后,模型的状态如下:
强化学习模型报告
值,这是不正确的。所以可能你误解了Q学习算法。
迭代:2000000
得知状态:5477
最大值:1.090465
最小值:-0.554718
$ C $在代理人的5场比赛中,我有2场胜利(代理人不承认我连续有两块石头)和3画。解决方案如果我理解的很好,在你的Q-learning upd吃的规则,你正在使用当前的奖励和以前的奖励。然而,Q学习规则只使用一个回报(
x
是状态,u
是动作):
QmaxI've recently made an attempt to implement a basic Q-Learning algorithm in Golang. Note that I'm new to Reinforcement Learning and AI in general, so the error may very well be mine.
Here's how I implemented the solution to an m,n,k-game environment: At each given time
t
, the agent holds the last state-action(s, a)
and the acquired reward for it; the agent selects a movea'
based on an Epsilon-greedy policy and calculates the rewardr
, then proceeds to update the value ofQ(s, a)
for timet-1
func (agent *RLAgent) learn(reward float64) { var mState = marshallState(agent.prevState, agent.id) var oldVal = agent.values[mState] agent.values[mState] = oldVal + (agent.LearningRate * (agent.prevScore + (agent.DiscountFactor * reward) - oldVal)) }
Note:
agent.prevState
holds previous state right after taking the action and before the environment responds (i.e. after the agent makes it's move and before the other player makes a move) I use that in place of the state-action tuple, but I'm not quite sure if that's the right approachagent.prevScore
holds the reward to previous state-action- The
reward
argument represents the reward for current step's state-action (Qmax
)With
agent.LearningRate = 0.2
andagent.DiscountFactor = 0.8
the agent fails to reach 100K episodes because of state-action value overflow. I'm using golang'sfloat64
(Standard IEEE 754-1985 double precision floating point variable) which overflows at around±1.80×10^308
and yields±Infiniti
. That's too big a value I'd say!Here's the state of a model trained with a learning rate of
0.02
and a discount factor of0.08
which got through 2M episodes (1M games with itself):Reinforcement learning model report Iterations: 2000000 Learned states: 4973 Maximum value: 88781786878142287058992045692178302709335321375413536179603017129368394119653322992958428880260210391115335655910912645569618040471973513955473468092393367618971462560382976.000000 Minimum value: 0.000000
The reward function returns:
- Agent won: 1
- Agent lost: -1
- Draw: 0
- Game continues: 0.5
But you can see that the minimum value is zero, and the maximum value is too high.
It may be worth mentioning that with a simpler learning method I found in a python script works perfectly fine and feels actually more intelligent! When I play with it, most of the time the result is a draw (it even wins if I play carelessly), whereas with the standard Q-Learning method, I can't even let it win!
agent.values[mState] = oldVal + (agent.LearningRate * (reward - agent.prevScore))
Any ideas on how to fix this? Is that kind of state-action value normal in Q-Learning?!
Update: After reading Pablo's answer and the slight but important edit that Nick provided to this question, I realized the problem was
prevScore
containing the Q-value of previous step (equal tooldVal
) instead of the reward of the previous step (in this example, -1, 0, 0.5 or 1).After that change, the agent now behaves normally and after 2M episodes, the state of the model is as follows:
Reinforcement learning model report Iterations: 2000000 Learned states: 5477 Maximum value: 1.090465 Minimum value: -0.554718
and out of 5 games with the agent, there were 2 wins for me (the agent did not recognize that I had two stones in a row) and 3 draws.
解决方案If I've understood well, in your Q-learning update rule, you are using the current reward and the previous reward. However, the Q-learning rule only uses one reward (
x
are states andu
are actions):On the other hand, you are assuming that the current reward is the same that
Qmax
value, which is not true. So probably you are misunderstanding the Q-learning algorithm.这篇关于Q-Learning值太高的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!