q学习计算中的大量状态 [英] The huge amount of states in q-learning calculation
问题描述
我通过q-learning实现了3x3 OX游戏(它在AI vs AI和AI vs Human上都可以完美运行),但是我无法进一步前进到4x4 OX游戏,因为它将耗尽我所有的PC内存并崩溃.
I implemented a 3x3 OX game by q-learning ( it works perfectly in AI v.s AI and AI v.s Human), but I can't go one step further to 4x4 OX game since it will eat up all my PC memory and crash.
这是我当前的问题: 大规模访问冲突?
Here is my current problem: Access violation in huge array?
据我了解,一个3x3的OX游戏共有3(空格,白色,黑色)^ 9 = 19683个可能的状态. (相同模式的不同角度仍算在内)
In my understanding, a 3x3 OX game has a total 3(space, white, black) ^ 9 = 19683 possible states. ( same pattern different angle still count )
对于4x4 OX游戏,总状态为3 ^ 16 = 43,046,721
For a 4x4 OX game, the total state will be 3 ^ 16 = 43,046,721
对于15x15的常规围棋游戏,总状态为3 ^ 225〜2.5 x 10 ^ 107
For a regular go game, 15x15 board, the total state will be 3 ^ 225 ~ 2.5 x 10^107
Q1.我想知道我的计算正确与否. (对于4x4 OX游戏,我需要3 ^ 16数组吗?)
Q1. I want to know my calculation is correct or not. ( for 4x4 OX game, I need a 3^16 array ? )
Q2.由于我需要计算每个Q值(针对每个状态,每个动作),因此我需要大量的数组,这是预期的吗?有什么办法可以避免吗?
Q2. Since I need to calculate each Q value ( for each state, each action), I need such a large number of array, is it expected? any way to avoid it?
推荐答案
如果您不打算重新发明轮子,以下是解决此问题的方法:
If you skip reinventing the wheel, here is what have done to solve this problem:
该模型是一个卷积神经网络,使用 Q学习,其输入为原始像素,其输出为一个值 估算未来回报的功能.我们将方法应用于7个Atari 来自Arcade学习环境的2600场游戏,无需进行任何调整 体系结构或学习算法.
The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm.
https://arxiv.org/pdf/1312.5602v1.pdf >
我们可以用神经网络来表示我们的Q函数 状态(四个游戏画面)和动作作为输入并输出 相应的Q值.或者,我们只能拍摄游戏画面 作为输入并输出每个可能动作的Q值.这 方法的优势在于,如果我们要执行Q值 更新或选择Q值最高的动作,我们只需要做一个 通过网络前进并具有所有动作的所有Q值 立即可用.
We could represent our Q-function with a neural network, that takes the state (four game screens) and action as input and outputs the corresponding Q-value. Alternatively we could take only game screens as input and output the Q-value for each possible action. This approach has the advantage, that if we want to perform a Q-value update or pick the action with highest Q-value, we only have to do one forward pass through the network and have all Q-values for all actions immediately available.
https://neuro.cs.ut.ee/demystifying-深度强化学习/
这篇关于q学习计算中的大量状态的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!