DQN在奖励始终为-1的环境中如何工作 [英] How does DQN work in an environment where reward is always -1

查看：395 发布时间：2021/2/14 20:53:15 machine-learning keras reinforcement-learning openai-gym q-learning

本文介绍了DQN在奖励始终为-1的环境中如何工作的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

鉴于OpenAI Gym环境 MountainCar-v0 总是返回-1.0作为奖励(即使达到目标)，我不知道具有体验重播功能的DQN如何收敛，但我知道这样做，因为我有

Given that the OpenAI Gym environment MountainCar-v0 ALWAYS returns -1.0 as a reward (even when goal is achieved), I don't understand how DQN with experience-replay converges, yet I know it does, because I have working code that proves it. By working, I mean that when I train the agent, the agent quickly (within 300-500 episodes) learns how to solve the mountaincar problem. Below is an example from my trained agent.

据我了解，最终需要找到一个稀疏奖励".但据我从openAI Gym看到的代码，除-1之外没有其他任何奖励.感觉更像是一个无奖励"的环境.

It is my understanding that ultimately there needs to be a "sparse reward" that is found. Yet as far as I can see from the openAI Gym code, there is never any reward other than -1. It feels more like a "no reward" environment.

什么几乎可以回答我的问题，但实际上却没有:当任务快速完成时，剧集的回报(奖励总和)会更大.因此，如果汽车从未找到该标志，则返回值为-1000.如果汽车迅速找到该标志，则返回值可能是-200.之所以不能回答我的问题，是因为使用DQN和体验重播时，体验重播内存中永远不会出现这些回报(-1000，-200).所有的记忆都是元组形式的(状态，动作，奖励，next_state)，当然请记住，元组是随机从记忆中拉出的，而不是逐集地出现.

What almost answers my question, but in fact does not: when the task is completed quickly, the return (sum of rewards) of the episode is larger. So if the car never finds the flag, the return is -1000. If the car finds the flag quickly the return might be -200. The reason this does not answer my question is because with DQN and experience replay, those returns (-1000, -200) are never present in the experience replay memory. All the memory has are tuples of the form (state, action, reward, next_state), and of course remember that tuples are pulled from memory at random, not episode-by-episode.

此特定OpenAI Gym环境的另一个元素是，在以下两种情况之一中返回完成"状态:击中标志(是)或经过一定数量的步骤后超时(嘘声).但是，代理将两者对待，接受-1的奖励.因此，就记忆中的元组而言，从奖励的角度来看，两个事件看起来都是相同的.

Another element of this particular OpenAI Gym environment is that the Done state is returned on either of two occasions: hitting the flag (yay) or timing out after some number of steps (boo). However, the agent treats both the same, accepting the reward of -1. Thus as far as the tuples in memory are concerned, both events look identical from a reward standpoint.

因此，我在内存中看不到任何表明该情节执行得很好的东西.

So, I don't see anything in the memory that indicates that the episode was performed well.

因此，我不知道为什么此DQN代码适用于MountainCar.

And thus, I have no idea why this DQN code is working for MountainCar.

推荐答案

之所以可行，是因为在Q学习中，您的模型试图估算每项未来所有奖励的SUM(从技术上讲，是时间衰减的总和).可能的行动.在MountainCar中，直到获胜为止，您每走一步都会获得-1的奖励，因此，如果您确实赢了，最终得到的负面奖励就会比平时少.例如，您获胜后的总得分可能是-160而不是-200，因此您的模型将开始为历来导致游戏获胜的动作预测更高的Q值.

The reason this works is because in Q-learning, your model is trying to estimate the SUM (technically the time-decayed sum) of all future rewards for each possible action. In MountainCar you get a reward of -1 every step until you win, so if you do manage to win, you’ll end up getting less negative reward than usual. For example, your total score after winning might be -160 instead of -200, so your model will start predicting higher Q-values for actions that have historically led to winning the game.

这篇关于DQN在奖励始终为-1的环境中如何工作的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

DQN在奖励始终为-1的环境中如何工作 [英] How does DQN work in an environment where reward is always -1

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

DQN在奖励始终为-1的环境中如何工作 [英] How does DQN work in an environment where reward is always -1

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭