训练DQN时Q值爆炸 [英] Q-values exploding when training DQN

查看:568
本文介绍了训练DQN时Q值爆炸的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在训练DQN来玩OpenAI的Atari环境,但是我网络的Q值迅速爆炸,远远超过了实际值.

I'm training a DQN to play OpenAI's Atari environment, but the Q-values of my network quickly explode far above what is realistic.

这是代码的相关部分:

for state, action, reward, next_state, done in minibatch:
        if not done:
            # To save on memory, next_state is just one frame
            # So we have to add it to the current state to get the actual input for the network
            next_4_states = np.array(state)
            next_4_states = np.roll(next_4_states, 1, axis=3)
            next_4_states[:, :, :, 0] = next_state
            target = reward + self.gamma * \
                np.amax(self.target_model.predict(next_4_states))
        else:
            target = reward
        target_f = self.target_model.predict(state)
        target_f[0][action] = target

        self.target_model.fit(state, target_f, epochs=1, verbose=0)

折扣系数是0.99(折扣系数0.9不会发生这种情况,但由于无法考虑得足够远而无法收敛).

The discount factor is 0.99 (it doesn't happen with discount factor 0.9, but also doesn't converge because it can't think far enough ahead).

逐步执行代码,原因是所有不打算更新的Q值(我们未执行的动作的Q值)略有增加.我的理解是,在培训期间将网络自己的输出传递给网络应该保持输出不变,而不是增加或减少输出.我的模型出问题了吗?有什么方法可以屏蔽更新,使其仅更新相关的Q值?

Stepping through the code, the reason it's happening is all the Q values that aren't meant to be updated (the ones for actions we didn't take) increase slightly. It's my understanding that passing the networks own output to the network during training should keep the output the same, not increase or decrease it. Is there something wrong with my model? Is there some way I can mask the update so it only updates the relevant Q value?

我的模型创建代码在这里:

My model creation code is here:

def create_model(self, input_shape, num_actions, learning_rate):
        model = Sequential()
        model.add(Convolution2D(32, 8, strides=(4, 4),
                                activation='relu', input_shape=(input_shape)))
        model.add(Convolution2D(64, 4, strides=(2, 2), activation='relu'))
        model.add(Convolution2D(64, 3, strides=(1, 1), activation='relu'))
        model.add(Flatten())
        model.add(Dense(512, activation='relu'))
        model.add(Dense(num_actions))

        model.compile(loss='mse', optimizer=Adam(lr=learning_rate))

        return model

我创建其中两个.一种用于在线网络,另一种用于目标.

I create two of these. One for the online network and one for the target.

推荐答案

哪些预测得到更新?

逐步执行代码,原因是所有不打算更新的Q值(我们未执行的动作的Q值)略有增加.我的理解是,在训练过程中将网络自己的输出传递到网络应该保持输出不变,而不是增加或减少输出.

Stepping through the code, the reason it's happening is all the Q values that aren't meant to be updated (the ones for actions we didn't take) increase slightly. It's my understanding that passing the networks own output to the network during training should keep the output the same, not increase or decrease it.

下面,我绘制了一个非常简单的神经网络,其中包含3个输入节点,3个隐藏节点和3个输出节点.假设您只为第一个动作设置了新的目标,并简单地将现有的预测再次用作其他动作的目标.对于第一个操作/输出,这只会导致一个非零(为简单起见,我假设大于零)错误(在图像中由delta表示),而对于其他操作/输出,则导致错误0.

Below I have drawn a very simple neural network with 3 input nodes, 3 hidden nodes, and 3 output nodes. Suppose that you have only set a new target for the first action, and simply use the existing predictions as targets again for the other actions. This results in only a non-zero (for simplicity I'll just assume greater than zero) error (denoted by delta in the image) for the first action/output, and errors of 0 for the others.

我用粗体画出了将错误从输出传播到隐藏层的连接.请注意,隐藏层中的每个节点仍然会出现错误.当这些节点然后将其错误传播回输入层时,它们将通过输入和隐藏层之间的连接的全部来执行此操作,因此这些权重的全部可以是修改的.

I have drawn the connections through which this error will be propagated from output to hidden layer in bold. Note how each of the nodes in the hidden layer still gets an error. When these nodes then propagate their errors back to the input layer, they'll do this through all of the connections between input and hidden layer, so all of those weights can be modified.

因此,假设所有这些权重都已更新,现在想象使用原始输入进行新的前向传递.您是否期望输出节点2和3具有与以前完全相同的输出?不,可能不会;从隐藏节点到最后两个输出的连接可能仍具有相同的权重,但是所有三个隐藏节点将具有不同的激活级别.因此,不能保证其他输出不会保持不变.

So, imagine all those weights got updated, and now imagine doing a new forwards pass with the original inputs. Do you expect output nodes 2 and 3 to have exactly the same outputs as before? No, probably not; the connections from hidden nodes to the last two outputs may still have the same weights, but all three hidden nodes will have different activation levels. So no, the other outputs are not guaranteed to remain the same.

是否可以通过某种方式屏蔽更新,使其仅更新相关的Q值?

Is there some way I can mask the update so it only updates the relevant Q value?

不容易,如果没有的话.问题在于,除了最后一对之间的连接以外,层对之间的连接不是特定于动作的,而且我认为您也不希望它们两者都是.

Not easily no, if at all. The problem is that the connections between pairs of layers other than the connections between the final pair are not action-specific, and I don't think you want them to be either.

我的模型出问题了吗?

Is there something wrong with my model?

我看到的一件事是,您似乎正在更新用于生成目标的同一网络:

One thing I'm seeing is that you seem to be updating the same network that is used to generate targets:

target_f = self.target_model.predict(state)

self.target_model.fit(state, target_f, epochs=1, verbose=0)

都使用self.target_model.您应该为这两行使用单独的网络副本,并且仅在较长时间段之后才将更新的网络权重复制到用于计算目标的网络中.有关更多信息,请参见

both use self.target_model. You should use separate copies of the network for those two lines, and only after longer periods of time copy the updated network's weights into the network used to compute targets. For a bit more on this, see Addition 3 in this post.

除此之外,众所周知,DQN仍可能会高估Q值(尽管通常不应完全爆炸).可以使用 Double DQN 来解决(注意:这是一项改进,后来在DQN).

Apart from that, it is well known that DQN can still have a tendency to overestimate Q values (though it generally shouldn't completely explode). This can be addressed by using Double DQN (note: this is an improvement that was added later on top of DQN).

这篇关于训练DQN时Q值爆炸的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆