Q值无限制地增加,是在Q-Learning中重复相同动作后重复奖励的结果 [英] Unbounded increase in Q-Value, consequence of recurrent reward after repeating the same action in Q-Learning

查看:174
本文介绍了Q值无限制地增加,是在Q-Learning中重复相同动作后重复奖励的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在通过一个简单的应用程序开发一个简单的Q-Learning实现,但是有些事情一直困扰着我.

I'm in the process of development of a simple Q-Learning implementation over a trivial application, but there's something that keeps puzzling me.

让我们考虑Q-Learning的标准制定方式

Let's consider the standard formulation of Q-Learning

Q(S, A) = Q(S, A) + alpha * [R +  MaxQ(S', A') - Q(S, A)]

让我们假设该状态K有两个可能的动作,均以AA'授予我们的代理商奖励RR'.

Let's assume there's this state K that has two possible actions, both awarding our agent rewards R and R' by A and A'.

如果我们遵循几乎完全贪婪的方法(假设我们假设0.1 epsilon),那么我首先会随机选择一个动作,例如A.下次,我可能会(90%的时间)再次选择A,这将导致Q(K,A)不断增长,即使我偶然尝试A' ,因为它的报酬可能与A的报酬相同,所以我们将陷入一种情况,在其余的学习过程中,几乎不可能从我们的第一个猜测中恢复".

If we follow an almost-totally-greedy approach (let's say we assume a 0.1 epsilon), I'll at first randomly choose one of the actions, for instance A. The next time, I'll probably (90% of the times) choose again A and that will cause that Q(K, A) keeps growing and growing, being true the case that even if by chance I try A', as probably its reward is on the same magnitude as that of A, we'll have gotten into a situation where it's practically impossible to "recover" from our first guess, during the rest of the learning.

我想可能不是这样,否则代理基本上不会学习-只需遵循一个简单的方法:像您第一次一样做所有事情.

I guess this must not be so, otherwise the agent would basically not learn -- it'd be just follow a simple recipe : do everything as you did your first time.

我错过了什么吗?我知道我可以调整alpha值(通常随着时间的推移而减小),但这丝毫不会改善我们的情况.

Am I missing something? I know I can tweak the alpha value (generally, decreasing it over time), but that in no way improves our situation.

推荐答案

来自,我们知道:

使用任何探索策略都可以实现Q学习的收敛,并且仅要求每个状态动作对(s,a) 无限次地执行.

epsilon-greedy policy是勘探与开发之间的平衡,既保证了融合又保证了良好的性能.但是在实际问题中,我们经常需要一些启发式方法来更改学习速度alpha以代表更好的回报.否则,infinite often要求很难满足.

The epsilon-greedy policy is a balance between exploration and exploitation, which both guarantees convergence and often good performance. But in practical problems, we often need some heuristics to change the learning speed alpha to represent a better return. Otherwise, the infinite often requirement is hard to meet.

我在下面列出一个示例.这是一个经典问题,其中您有一个网格,并且每个单元格中的奖励金额可能不同.例如,下面显示了一个4x4的网格,其中除左上角的单元格(每个元素包含10的奖励更大)之外,每个单元格包含的奖励为1.机器人在网格中移动.合法行为正在移动LEFTRIGHTUPDOWN,但是机器人无法移出网格.

I list an example below. This is a classical problem, in which you have a grid and you have possibly different reward amount in each cell. For instance, a 4x4 grid is shown below, in which every cell contains a reward of 1, except the top-left cell (you have a bigger reward with an amount of 10). A robot is moving in the grid. The legal actions are moving LEFT, RIGHT, UP and DOWN, but the robot cannot move out of the grid.

因此我们的状态空间包含16个不同的状态,对应于16个单元格.由于边界的限制,每个州的法律诉讼数量不同.我们的目标是计算最佳策略(给定任何状态s,输出最佳操作a).

So our state space contains 16 distinct states, which corresponds to the 16 cells. For each state, there are different number of legal actions because of the border constraint. Our goal is to calculate the optimal policy (given any state s, output an optimal action a).

+++++++++++++++++++++
+ 10 +  1 +  1 + 1  +
+++++++++++++++++++++
+ 1  +  1 +  1 + 1  +
+++++++++++++++++++++
+ 1  +  1 +  1 + 1  +
+++++++++++++++++++++
+ 1  +  1 +  1 + 1  +
+++++++++++++++++++++

假设我们将epsilon-greedy policyepsilon=0.1一起使用,学习率恒定为alpha=0.1.我们从网格上的随机位置开始.每当我们到达左上角时,我们都会重新以随机位置重新开始.

Suppose we use an epsilon-greedy policy with epsilon=0.1, a constant learning rate alpha=0.1. We start with a random position on the grid. Whenever we reach the top-left corner, we restart with a random position again.

下面是对200,000个移动进行模拟的结果.最左侧的块直观地显示了每个单元格上的当前贪婪策略.

Below is a result of running a simulation of 200,000 moves. The left-most block shows visually the current greedy policy at each cell.

  • -->向右移动
  • <--向左移动
  • ^上移
  • v下移
  • --> Moving right
  • <-- Moving left
  • ^ Moving up
  • v Moving down

因此,您看到这远非最佳策略.显然,在最佳策略中,每个单元格都应指向左侧或上方,因为我们在位置(0,0)处可获得较大的奖励.

So you see this is far from an optimal policy. Obviously in an optimal policy, every cell should point to either left or up because we have a significant bigger reward at position (0,0).

 v   v   v   v   |      2      9      5      4   
 v   v   v   v   |     14     98     75     14   
-->  v   v  <--  |    258   3430   3312    245  
--> --> <-- <--  |   3270  93143  92978   3191  

右边的方框显示到目前为止,我们访问每个单元有多少次.您会看到,我们大部分访问都是在底部进行的,但访问顶部的行却很少见.这就是为什么我们还没有达到最佳政策的原因.

The right block shows how many times we visit each cell so far. You see that we spend most of the visits at the bottom but we visit the top row very rare. This is why we have not reached the optimal policy yet.

如果将学习率更改为alpha=1/(number of times you visited (s,a) so far),则可以在20,000步之内达到最佳策略(如下所示).同样,我们访问每个单元的次数虽然不完美,但分布更均匀.

If we change the learning rate to be alpha=1/(number of times you visited (s,a) so far), we are able to reach the optimal policy (shown below) within 20,000 steps. Also the number of times we visited each cell are distributed more uniformly though not perfect.

 --> <-- <-- <--  |     34   7997   7697    294 
  ^   ^   ^  <--  |    731    898    524    132 
  ^   ^   ^   ^   |    709    176     88     94 
  ^   ^   ^   ^   |    245    256     96     77  

对于具有更多状态的较大问题,例如10x10的网格,我发现最好使用较大的epsilon.例如,以下是在epsilon=0.5的10x10网格上进行了80,000次移动后的仿真结果.除了右下角,几乎是最佳的.关于使用模拟退火帮助提高Q的收敛速度的想法. -学习.

For a larger problem with more states, e.g., a 10x10 grids, I find it's better to use larger epsilon. For example, below is a result of a simulation after 80,000 moves on a 10x10 grid with epsilon=0.5. It's almost optimal except the bottom-right corner. There is also idea about using Simulated Annealing to help improving the convergence rate of Q-learning.

 v  <-- <-- <-- <-- <-- <-- <-- <-- <--  |     19   2500   1464    716    386    274    216    159    121     71 
 ^  <-- <-- <-- <--  v  <-- <-- <-- <--  |   9617  11914   3665   1071    580    410    319    225    207    131 
 ^   ^   ^  <-- <-- <-- <--  v  <-- <--  |   5355   5716   2662   1675   1465    611    302    183    162    101 
 ^   ^   ^   ^   ^  <-- <-- <-- <-- <--  |   1604   1887   1192    621   1056    882    693    403    206    100 
 ^   ^   ^   ^   ^   ^   ^  <-- <-- <--  |    639    735    731    333    412    399    480    294    172    114 
 ^   ^   ^  <--  ^   ^   ^  <-- <--  ^   |    373    496    640    454    272    266    415    219    107     98 
 ^   ^   ^   ^   ^   ^   ^   ^  <--  ^   |    251    311    402    428    214    161    343    176    114     99 
 ^   ^   ^   ^  <-- -->  ^  <-- <-- <--  |    186    185    271    420    365    209    359    200    113     70 
 ^   ^   ^   ^   ^   ^   ^   ^   v   v   |    129    204    324    426    434    282    235    131     99     74 
 ^   ^   ^   ^   ^  <--  ^  <-- <-- <--  |    100    356   1020   1233    703    396    301    216    152     78 

顺便说一句,我关于玩具问题的Python代码(约100行)在此处.

BTW, my Python code (~100 lines) for the toy problem is here.

这篇关于Q值无限制地增加,是在Q-Learning中重复相同动作后重复奖励的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆