q学习计算中的大量状态 [英] The huge amount of states in q-learning calculation

查看:160
本文介绍了q学习计算中的大量状态的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通过q-learning实现了3x3 OX游戏(它在AI vs AI和AI vs Human上都可以完美运行),但是我无法进一步前进到4x4 OX游戏,因为它将耗尽我所有的PC内存并崩溃.

I implemented a 3x3 OX game by q-learning ( it works perfectly in AI v.s AI and AI v.s Human), but I can't go one step further to 4x4 OX game since it will eat up all my PC memory and crash.

这是我当前的问题: 大规模访问冲突?

Here is my current problem: Access violation in huge array?

据我了解,一个3x3的OX游戏共有3(空格,白色,黑色)^ 9 = 19683个可能的状态. (相同模式的不同角度仍算在内)

In my understanding, a 3x3 OX game has a total 3(space, white, black) ^ 9 = 19683 possible states. ( same pattern different angle still count )

对于4x4 OX游戏,总状态为3 ^ 16 = 43,046,721

For a 4x4 OX game, the total state will be 3 ^ 16 = 43,046,721

对于15x15的常规围棋游戏,总状态为3 ^ 225〜2.5 x 10 ^ 107

For a regular go game, 15x15 board, the total state will be 3 ^ 225 ~ 2.5 x 10^107

Q1.我想知道我的计算正确与否. (对于4x4 OX游戏,我需要3 ^ 16数组吗?)

Q1. I want to know my calculation is correct or not. ( for 4x4 OX game, I need a 3^16 array ? )

Q2.由于我需要计算每个Q值(针对每个状态,每个动作),因此我需要大量的数组,这是预期的吗?有什么办法可以避免吗?

Q2. Since I need to calculate each Q value ( for each state, each action), I need such a large number of array, is it expected? any way to avoid it?

推荐答案

如果您不打算重新发明轮子,以下是解决此问题的方法:

If you skip reinventing the wheel, here is what have done to solve this problem:

该模型是一个卷积神经网络,使用 Q学习,其输入为原始像素,其输出为一个值 估算未来回报的功能.我们将方法应用于7个Atari 来自Arcade学习环境的2600场游戏,无需进行任何调整 体系结构或学习算法.

The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm.

https://arxiv.org/pdf/1312.5602v1.pdf

我们可以用神经网络来表示我们的Q函数 状态(四个游戏画面)和动作作为输入并输出 相应的Q值.或者,我们只能拍摄游戏画面 作为输入并输出每个可能动作的Q值.这 方法的优势在于,如果我们要执行Q值 更新或选择Q值最高的动作,我们只需要做一个 通过网络前进并具有所有动作的所有Q值 立即可用.

We could represent our Q-function with a neural network, that takes the state (four game screens) and action as input and outputs the corresponding Q-value. Alternatively we could take only game screens as input and output the Q-value for each possible action. This approach has the advantage, that if we want to perform a Q-value update or pick the action with highest Q-value, we only have to do one forward pass through the network and have all Q-values for all actions immediately available.

https://neuro.cs.ut.ee/demystifying-深度强化学习/

这篇关于q学习计算中的大量状态的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆