奖励功能，用于学习使用DQN玩“曲线发烧"游戏 [英] Reward function for learning to play Curve Fever game with DQN

查看：83 发布时间：2021/4/29 20:50:55 machine-learning tensorflow deep-learning reinforcement-learning q-learning

本文介绍了奖励功能，用于学习使用DQN玩“曲线发烧"游戏的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我制作了一个简单的Curve Fever版本，也称为"Achtung Die Kurve".我希望机器找出最佳玩法.我从一些使用Google Tensorflow制作的Atari游戏示例中复制了现有的DQN并对其进行了一些修改.

I've made a simple version of Curve Fever also known as "Achtung Die Kurve". I want the machine to figure out how to play the game optimally. I copied and slightly modified an existing DQN from some Atari game examples that is made with Google's Tensorflow.

我想找出合适的奖励功能.目前，我使用以下奖励设置:

I'm tyring to figure out an appropriate reward function. Currently, I use this reward setup:

不会崩溃的每一帧为0.1
-500每次崩溃

这是正确的方法吗?我需要调整值吗?还是我需要一种完全不同的方法?

Is this the right approach? Do I need to tweak the values? Or do I need a completely different approach?

推荐答案

-500的奖励会破坏您的网络.您应该将奖励缩放到1到-1之间的值.(也可以在-1和1或0和1之间缩放输入图像).

The reward of -500 can destroy your network. You should scale the rewards to the values between 1 and -1. (Also scale the input image between -1 and 1 or 0 and 1).

只要给您的网络崩溃带来-1的奖励，而当敌人崩溃时给您+1的奖励.没有敌人，坠毁时获得-1的奖励就足够了.在某些情况下(例如，当网络必须在两次不可避免的崩溃之间做出决定时，哪一次发生得比另一次发生得更快)，拥有少量恒定的积极生活报酬可能是有益的，但这也会使Q函数的学习变得更加复杂.您可以不断尝试，而不必不断获得奖励，看看哪种方法最有效.

Just give your network a reward of -1 for crashing and a reward of +1 once an enemy crashes. Without enemies a reward of -1 for crashing should be enough. Having a small constant positive living reward can be beneficial in some situations (like when the network has to decide between two inevitable crashes of which one will happen faster than the other) but it will also make the learning of the Q-function more complicated. You can just try with and without a constant reward and see what works best.

不可避免的崩溃示例也说明了为什么不应该使用小的负面生活报酬.在这种情况下，网络会选择速度最快的崩溃路径，而在这种情况下，尽可能延长崩溃时间将是更好的策略.

The example with an inevitable crash also shows why you should not use a small negative living reward. In such a case the network would chose the path of the fastest crash, while delaying the crash as much as possible would be the better strategy in that situation.

这篇关于奖励功能，用于学习使用DQN玩“曲线发烧"游戏的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

奖励功能，用于学习使用DQN玩“曲线发烧"游戏 [英] Reward function for learning to play Curve Fever game with DQN

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

奖励功能，用于学习使用DQN玩“曲线发烧"游戏 [英] Reward function for learning to play Curve Fever game with DQN

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭