如何摆脱“粘性"状态? [英] How to get out of 'sticky' states?

查看:70
本文介绍了如何摆脱“粘性"状态?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:

我已经训练了一个代理在网格世界中执行一个简单的任务(在没有碰到障碍物的情况下走到网格的顶部),但是下面的情况似乎总是发生.它发现自己处于状态空间的一个简单部分(没有障碍物),因此不断收到强烈的正强化信号.然后,当它发现自己是状态空间的困难部分(楔在两个障碍物旁边)时,它只是选择与以前相同的动作,但没有任何效果(它上升并击中障碍物).最终该值的 Q 值与负奖励匹配,但此时其他动作的 Q 值甚至更低,因为在状态空间的简单部分无用,因此误差信号降至零,错误动作仍然总是选择了.

I've trained an agent to perform a simple task in a grid world (go to the top of the grid while not hitting obstacles), but the following situation always seems to occur. It finds itself in a easy part of the state space (no obstacles), and so continually gets a strong positive reinforcement signal. Then when it does find itself is difficult part of the state space (wedged next to two obstacles) it simply chooses same action as before, to no effect (It goes up and hits the obstacle). Eventually the Q value for this value matches the negative reward, but by this time the other actions have even lower Q values from being useless in the easy part of the state space, so the error signal drops to zero and the incorrect action is still always chosen.

我怎样才能防止这种情况发生?我已经想到了一些解决方案,但似乎都不可行:

How can I prevent this from happening? I've thought of a few solutions, but none seem viable:

  • 使用始终需要大量探索的策略.由于障碍物需要大约 5 个动作才能绕过,因此时不时的单个随机动作似乎无效.
  • 使奖励函数使得重复的不良行为变得更糟.这使得奖励函数破坏了马尔可夫性质.也许这不是一件坏事,但我根本不知道.
  • 只对完成任务的代理进行奖励.该任务需要一千多个动作才能完成,因此训练信号会太弱.
  • Use a policy that is always exploration heavy. As the obstacles take ~5 actions to get around, a single random action every now and then seems ineffective.
  • Make the reward function such that bad actions are worse when they are repeated. This makes the reward function break the Markov property. Maybe this isn't a bad thing, but I simply don't have a clue.
  • Only reward the agent for completing the task. The task takes over a thousand actions to complete, so the training signal would be way too weak.

任务的一些背景:

因此,我为尝试 RL 算法制作了一个小测试平台——类似于 Sutton 书中描述的更复杂的网格世界版本.世界是一个大的二进制网格(300 x 1000),在 0 的背景下以随机大小的矩形形式填充 1.一圈 1 环绕着世界的边缘.

So I've made a little testbed for trying out RL algorithms -- something like a more complex version of the grid-world described in the Sutton book. The world is a large binary grid (300 by 1000) populated by 1's in the form of randomly sized rectangles on a backdrop of 0's. A band of 1's surrounds the edges of the world.

一个代理在这个世界中占据一个空间,周围只有一个固定的窗口(41 x 41 窗口,代理在中心).代理的动作包括在四个主要方向中的任何一个方向上移动 1 个空间.智能体只能通过标记为 0 的空间,1 是不能通过的.

An agent occupies a single space in this world and only a fixed windows around it (41 by 41 window with the agent in the center). The agent's actions consist of moving by 1 space in any of the four cardinal directions. The agent can only move through spaces marked by a 0, 1's are impassible.

在此环境中要执行的当前任务是从底部的随机位置开始,使其到达网格世界的顶部.成功向上移动将获得 +1 奖励.任何会撞到障碍物或世界边缘的移动都会给予 -1 的奖励.所有其他状态获得 0 奖励.

The current task to be performed in this environment is to make it to the top of the grid world starting from a random position along the bottom. A reward of +1 is given for successfully moving upwards. A reward of -1 is given for any move that would hit an obstacle or the edge of the world. All other states receive a reward of 0.

代理使用带有神经净值函数逼近器的基本 SARSA 算法(如 Sutton 书中所述).对于政策决策,我尝试了 e-greedy 和 softmax.

The agent uses the basic SARSA algorithm with a neural net value function approximator (as discussed in the Sutton book). For policy decisions I've tried both e-greedy and softmax.

推荐答案

教授此类任务的典型方法是在每一步都给代理一个负奖励,然后在完成时给予大笔奖励.您可以通过使用资格跟踪以及将代理放置在最初靠近目标的位置,然后靠近它已经探索过的区域来补偿长时间的延迟.

The typical way of teaching such tasks is to give the agent a negative reward each step and then a big payout on completion. You can compensate for the long delay by using eligibility traces and by placing the agent close to the goal initially, and then close to the area it has explored.

这篇关于如何摆脱“粘性"状态?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆