Q-Learning值太高 [英] Q-Learning values get too high

查看：206 发布时间：2017/12/21 21:46:22 go floating-point reinforcement-learning q-learning

本文介绍了Q-Learning值太高的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

最近我尝试在Golang中实现一个基本的Q-Learning算法。请注意，我是一般的钢筋学习和人工智能的新手，所以错误可能是我的。

以下是我如何实现m，n， K游戏环境：
在每个给定时间 t ，代理保存最后的状态动作（s，a）和获得的奖励;该代理根据Epsilon-greedy策略选择一个移动 a'并计算奖励 r ，然后继续更新时间 t-1
$ b的值 Q（s，a） $ b

  FUNC（剂* RLAgent）学会（奖励float64）{
变种mState = marshallState（agent.prevState，agent.id）
变种OLDVAL = agent.values [mState] 
 
 agent.values [mState] = OLDVAL +（agent.LearningRate * 
（agent.prevScore +（agent.DiscountFactor *奖励） -  OLDVAL））
 
 
 
 
 
  
 
 $ ul 
 
  agent.prevState 在采取行动之后以及在环境响应之前（即在代理人移动之后，在另一个玩家做出移动）我使用它来代替状态动作元组，但是我不太确定这是否正确的方法 
   agent.prevScore 对以前的状态行动持有奖励
  奖励参数代表当前步骤的状态动作（ Qmax ）
  
 
  agent.LearningRate = 0.2 和 agent.DiscountFactor = 0.8 我正在使用golang的 float64 （标准IEEE 754-1985双精度浮点变量），它在±1.80×10 ^ 308 并产生±英菲尼迪。这个值太大了我会说！
 
 
以下是学习率为 0.02 以及通过2M集（本身1M游戏）获得的 0.08 的折扣因子：
 
 
 强化学习模型报告
迭代：2000000 
得知状态：4973 
最大值：88781786878142287058992045692178302709335321375413536179603017129368394119653322992958428880260210391115335655910912645569618040471973513955473468092393367618971462560382976.000000 
最小值：0.000000 
  
 
 奖励函数返回：
 
 
 
 代理赢了： 1 
 
 损失代理：-1 
 
 绘制：0 
 
 游戏继续：0.5 
 
 
 
 但是你可以看到最小值为零，最大值太高。
 
可能值得一提的是，用一个简单的学习方法，我发现在Python脚本中完美无瑕你好，感觉实际上更聪明！当我玩这个游戏的时候，大部分时间的结果都是平分的（如果我不小心玩的话，这个结果甚至会赢），而使用标准的Q-Learning方法，我甚至不能让它赢得胜利！
 
 
  agent.values [mState] = oldVal +（agent.LearningRate *（reward  -  agent.prevScore））
  
有关如何解决此问题的任何想法？ 
在Q-Learning中，这种状态行为的价值是否正常？！
 
 
 
 
 更新：  
在阅读了Pablo的回答以及Nick提供给这个问题的轻微而重要的编辑之后，我意识到问题是 prevScore ，其中包含Q值上一步（等于 oldVal ），而不是前一步的奖励（在本例中为-1,0,0.5或1）。
 
 
在这个变化之后，代理现在正常运行，在2M集之后，模型的状态如下： 
 
 
 强化学习模型报告
迭代：2000000 
得知状态：5477 
最大值：1.090465 
最小值：-0.554718 
 解决方案如果我理解的很好，在你的Q-learning upd吃的规则，你正在使用当前的奖励和以前的奖励。然而，Q学习规则只使用一个回报（ x 是状态， u 是动作）：
    Qmax 值，这是不正确的。所以可能你误解了Q学习算法。
 
I've recently made an attempt to implement a basic Q-Learning algorithm in Golang. Note that I'm new to Reinforcement Learning and AI in general, so the error may very well be mine.

Here's how I implemented the solution to an m,n,k-game environment:
At each given time t, the agent holds the last state-action (s, a) and the acquired reward for it; the agent selects a move a' based on an Epsilon-greedy policy and calculates the reward r, then proceeds to update the value of Q(s, a) for time t-1
func (agent *RLAgent) learn(reward float64) {
    var mState = marshallState(agent.prevState, agent.id)
    var oldVal = agent.values[mState]

    agent.values[mState] = oldVal + (agent.LearningRate *
        (agent.prevScore + (agent.DiscountFactor * reward) - oldVal))
}
Note:


agent.prevState holds previous state right after taking the action and before the environment responds (i.e. after the agent makes it's move and before the other player makes a move) I use that in place of the state-action tuple, but I'm not quite sure if that's the right approach
agent.prevScore holds the reward to previous state-action
The reward argument represents the reward for current step's state-action (Qmax)


With agent.LearningRate = 0.2 and agent.DiscountFactor = 0.8 the agent fails to reach 100K episodes because of state-action value overflow.
I'm using golang's float64 (Standard IEEE 754-1985 double precision floating point variable) which overflows at around ±1.80×10^308 and yields ±Infiniti. That's too big a value I'd say!

Here's the state of a model trained with a learning rate of 0.02 and a discount factor of 0.08 which got through 2M episodes (1M games with itself):
Reinforcement learning model report
Iterations: 2000000
Learned states: 4973
Maximum value: 88781786878142287058992045692178302709335321375413536179603017129368394119653322992958428880260210391115335655910912645569618040471973513955473468092393367618971462560382976.000000
Minimum value: 0.000000
The reward function returns:


Agent won: 1
Agent lost: -1
Draw: 0
Game continues: 0.5


But you can see that the minimum value is zero, and the maximum value is too high.

It may be worth mentioning that with a simpler learning method I found in a python script works perfectly fine and feels actually more intelligent! When I play with it, most of the time the result is a draw (it even wins if I play carelessly), whereas with the standard Q-Learning method, I can't even let it win!
agent.values[mState] = oldVal + (agent.LearningRate * (reward - agent.prevScore))
Any ideas on how to fix this?
Is that kind of state-action value normal in Q-Learning?!



Update:
After reading Pablo's answer and the slight but important edit that Nick provided to this question, I realized the problem was prevScore containing the Q-value of previous step (equal to oldVal) instead of the reward of the previous step (in this example, -1, 0, 0.5 or 1).

After that change, the agent now behaves normally and after 2M episodes, the state of the model is as follows:
Reinforcement learning model report
Iterations: 2000000
Learned states: 5477
Maximum value: 1.090465
Minimum value: -0.554718
and out of 5 games with the agent, there were 2 wins for me (the agent did not recognize that I had two stones in a row) and 3 draws.
 解决方案 
If I've understood well, in your Q-learning update rule, you are using the current reward and the previous reward. However, the Q-learning rule only uses one reward (x are states and u are actions):


On the other hand, you are assuming that the current reward is the same that Qmax value, which is not true. So probably you are misunderstanding the Q-learning algorithm.

                        这篇关于Q-Learning值太高的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Q-Learning值太高 [英] Q-Learning values get too high

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Q-Learning值太高 [英] Q-Learning values get too high

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭