为乒乓球比赛找到正确的神经网络参数 [英] Finding the right parameters for neural network for pong-game

查看:114
本文介绍了为乒乓球比赛找到正确的神经网络参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在为Pong游戏实现深度神经网络时遇到了一些麻烦,因为无论我更改哪个参数,我的网络总是在发散. 我参加了Pong-Game并实现了基于theano/lasagne的Deep-q学习算法,该算法基于Google Deepmind着名的自然论文.

I have some trouble with my implementation of a deep neural network to the game Pong because my network is always diverging, regardless which parameters I change. I took a Pong-Game and implemented a theano/lasagne based deep-q learning algorithm which is based on the famous nature paper by Googles Deepmind.

我想要的是
我不想为网络提供像素数据,而是要输入连续4帧的球的x和y位置以及球拍的y位置.所以我总共有12个输入.
我只想奖励回合的失败,失败和胜利.
使用此配置,网络无法融合,并且我的代理无法玩游戏.而是直接将桨移至顶部或底部,或重复相同的模式.所以我想我想让代理更容易一些,并添加一些信息.

What I want:
Instead of feeding the network with pixel data I want to input the x- and y-position of the ball and the y-position of the paddle for 4 consecutive frames. So I got a total of 12 inputs.
I only want to reward the hit, the loss, and the win of a round.
With this configuration, the network did not converge and my agent was not able to play the game. Instead, the paddle drove directly to the top or bottom or repeated the same pattern. So I thought I try to make it a bit easier for the agent and add some information.

我做了什么:
状态:

What I did:
States:

  • 球的x位置(-1到1)
  • 球的y位置(-1到1)
  • 球的标准化x速度
  • 球的归一化y速度
  • 拨片的y位置(-1到1)

连续4帧,我的总输入为20.

With 4 consecutive frames I get a total input of 20.

奖励:

  • +10,如果Paddle击中球
  • +100(如果特工获胜)
  • -100,如果特工输掉了回合
  • -5到0,表示球的预测最终位置(y位置)和球拍当前y位置之间的距离
  • +20,如果预测的球的最终位置在桨的当前范围内(可预见的击球)
  • -5如果球位于球拍后面(不再有击球的可能性)

使用此配置,网络仍会分歧.我尝试使用学习率(0.1到0.00001),隐藏层的节点(5到500),隐藏层的数量(1到4),批处理累加器(总和或均值),更新规则(rmsprop或Deepminds rmsprop).
所有这些都没有导致令人满意的解决方案.损失平均值的图表大多看起来像. 您可以在此处
下载我当前的实施版本 如果有任何提示,我将不胜感激:)
小花

With this configuration, the network still diverges. I tried to play around with the learning rate (0.1 to 0.00001), the nodes of the hidden layers (5 to 500), the number of hidden layers (1 to 4), the batch accumulator (sum or mean), the update rule (rmsprop or Deepminds rmsprop).
All of these did not lead to a satisfactory solution. The graph of the loss averages mostly looks something like this. You can download my current version of the implementation here
I would be very grateful for any hint :)
Koanashi

推荐答案

现在从评论中重复我的建议作为答案,以便以后在此页面上看到的其他人都更容易看到(因为我是不是100%肯定会是解决方案):

Repeating my suggestion from comments as an answer now to make it easier to see for anyone else ending up on this page later (was posted as comment first since I was not 100% sure it'd be the solution):

减少位于(或至少接近于[0.0,1.0]或[-1.0,1.0])区间的奖励幅度,有助于网络更快地收敛.

Reducing the magnitude of the rewards to lie in (or at least close to) the [0.0, 1.0] or [-1.0, 1.0] intervals helps the network to converge more quickly.

以这种方式更改奖励值(简单地将它们全部除以一个数字,使它们处于较小的间隔内)不会改变网络在理论上的学习能力.网络还可以通过在整个网络中找到更大的权重来简单地学习相同的概念,从而获得更大的回报.

Changing the reward values in such a way (simply dividing them all by a number to make them lie in a smaller interval) does not change what a network is able to learn in theory. The network could also simply learn the same concepts with larger rewards by finding larger weights throughout the network.

但是,学习如此大的权重通常会花费更多的时间.这样做的主要原因是,权重通常会初始化为接近0的随机值,因此需要花费大量时间才能通过训练将这些值更改为大值.由于权重已初始化为较小的值(通常),并且与最佳权重值相距甚远,因此这也意味着存在局部(全局)风险的增加达到最佳重量值的最小值,它可能会卡在其中.

However, learning such large weights typically takes much more time. The main reason for this is that weights are often intialized to random values close to 0, so it takes a lot of time to change those values to large values through training. Because the weights are initialized to small values (typically), and they are very far away from the optimal weight values, this also means that there is an increased risk of there being a local (not a global) minimum along the way to the optimal weight values, which it can get stuck in.

使用较低的奖励值,最佳权重值的幅度也可能很低.这意味着初始化为小的随机值的权重已经很可能接近其最佳值.这样可以缩短培训时间(减少非正式旅行所需的距离"),并降低在进入过程中出现局部最小值的风险.

With lower reward values, the optimal weight values are likely to be low in magnitude as well. This means that weights initialized to small random values are already more likely to be close to their optimal values. This leads to a shorter training time (less "distance" to travel to put it informally), and a decreased risk of there being local minima along the way to get stuck in.

这篇关于为乒乓球比赛找到正确的神经网络参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆