DQN在奖励始终为-1的环境中如何工作 [英] How does DQN work in an environment where reward is always -1

查看:395
本文介绍了DQN在奖励始终为-1的环境中如何工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于OpenAI Gym环境 MountainCar-v0 总是返回-1.0作为奖励(即使达到目标),我不知道具有体验重播功能的DQN如何收敛,但我知道这样做,因为我有

Given that the OpenAI Gym environment MountainCar-v0 ALWAYS returns -1.0 as a reward (even when goal is achieved), I don't understand how DQN with experience-replay converges, yet I know it does, because I have working code that proves it. By working, I mean that when I train the agent, the agent quickly (within 300-500 episodes) learns how to solve the mountaincar problem. Below is an example from my trained agent.

据我了解,最终需要找到一个稀疏奖励".但据我从openAI Gym看到的代码,除-1之外没有其他任何奖励.感觉更像是一个无奖励"的环境.

It is my understanding that ultimately there needs to be a "sparse reward" that is found. Yet as far as I can see from the openAI Gym code, there is never any reward other than -1. It feels more like a "no reward" environment.

什么几乎可以回答我的问题,但实际上却没有:当任务快速完成时,剧集的回报(奖励总和)会更大.因此,如果汽车从未找到该标志,则返回值为-1000.如果汽车迅速找到该标志,则返回值可能是-200.之所以不能回答我的问题,是因为使用DQN和体验重播时,体验重播内存中永远不会出现这些回报(-1000,-200).所有的记忆都是元组形式的(状态,动作,奖励,next_state),当然请记住,元组是随机从记忆中拉出的,而不是逐集地出现.

What almost answers my question, but in fact does not: when the task is completed quickly, the return (sum of rewards) of the episode is larger. So if the car never finds the flag, the return is -1000. If the car finds the flag quickly the return might be -200. The reason this does not answer my question is because with DQN and experience replay, those returns (-1000, -200) are never present in the experience replay memory. All the memory has are tuples of the form (state, action, reward, next_state), and of course remember that tuples are pulled from memory at random, not episode-by-episode.

此特定OpenAI Gym环境的另一个元素是,在以下两种情况之一中返回完成"状态:击中标志(是)或经过一定数量的步骤后超时(嘘声).但是,代理将两者对待,接受-1的奖励.因此,就记忆中的元组而言,从奖励的角度来看,两个事件看起来都是相同的.

Another element of this particular OpenAI Gym environment is that the Done state is returned on either of two occasions: hitting the flag (yay) or timing out after some number of steps (boo). However, the agent treats both the same, accepting the reward of -1. Thus as far as the tuples in memory are concerned, both events look identical from a reward standpoint.

因此,我在内存中看不到任何表明该情节执行得很好的东西.

So, I don't see anything in the memory that indicates that the episode was performed well.

因此,我不知道为什么此DQN代码适用于MountainCar.

And thus, I have no idea why this DQN code is working for MountainCar.

推荐答案

之所以可行,是因为在Q学习中,您的模型试图估算每项未来所有奖励的SUM(从技术上讲,是时间衰减的总和).可能的行动.在MountainCar中,直到获胜为止,您每走一步都会获得-1的奖励,因此,如果您确实赢了,最终得到的负面奖励就会比平时少.例如,您获胜后的总得分可能是-160而不是-200,因此您的模型将开始为历来导致游戏获胜的动作预测更高的Q值.

The reason this works is because in Q-learning, your model is trying to estimate the SUM (technically the time-decayed sum) of all future rewards for each possible action. In MountainCar you get a reward of -1 every step until you win, so if you do manage to win, you’ll end up getting less negative reward than usual. For example, your total score after winning might be -160 instead of -200, so your model will start predicting higher Q-values for actions that have historically led to winning the game.

这篇关于DQN在奖励始终为-1的环境中如何工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆