Keras代码Q学习OpenAI健身房FrozenLake出了点问题 [英] Something wrong with Keras code Q-learning OpenAI gym FrozenLake

查看:186
本文介绍了Keras代码Q学习OpenAI健身房FrozenLake出了点问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

也许我的问题看起来很愚蠢.

Maybe my question will seem stupid.

我正在研究Q学习算法.为了更好地理解它,我尝试重新制作

I'm studying the Q-learning algorithm. In order to better understand it, I'm trying to remake the Tenzorflow code of this FrozenLake example into the Keras code.

我的代码:

import gym
import numpy as np
import random

from keras.layers import Dense
from keras.models import Sequential
from keras import backend as K    

import matplotlib.pyplot as plt
%matplotlib inline

env = gym.make('FrozenLake-v0')

model = Sequential()
model.add(Dense(16, activation='relu', kernel_initializer='uniform', input_shape=(16,)))
model.add(Dense(4, activation='softmax', kernel_initializer='uniform'))

def custom_loss(yTrue, yPred):
    return K.sum(K.square(yTrue - yPred))

model.compile(loss=custom_loss, optimizer='sgd')

# Set learning parameters
y = .99
e = 0.1
#create lists to contain total rewards and steps per episode
jList = []
rList = []

num_episodes = 2000
for i in range(num_episodes):
    current_state = env.reset()
    rAll = 0
    d = False
    j = 0
    while j < 99:
        j+=1

        current_state_Q_values = model.predict(np.identity(16)[current_state:current_state+1], batch_size=1)
        action = np.reshape(np.argmax(current_state_Q_values), (1,))

        if np.random.rand(1) < e:
            action[0] = env.action_space.sample() #random action

        new_state, reward, d, _ = env.step(action[0])

        rAll += reward
        jList.append(j)
        rList.append(rAll)

        new_Qs = model.predict(np.identity(16)[new_state:new_state+1], batch_size=1)
        max_newQ = np.max(new_Qs)

        targetQ = current_state_Q_values
        targetQ[0,action[0]] = reward + y*max_newQ
        model.fit(np.identity(16)[current_state:current_state+1], targetQ, verbose=0, batch_size=1)
        current_state = new_state

        if d == True:
            #Reduce chance of random action as we train the model.
            e = 1./((i/50) + 10)
            break
print("Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%")

我运行它时效果不佳:成功事件发生的百分比:0.052%

When I run it, it doesn't work well: Percent of succesful episodes: 0.052%

plt.plot(rList)

The original Tensorflow code is much more better: Percent of succesful episodes: 0.352%

plt.plot(rList)

我做错了什么?

推荐答案

除了在注释中提到的@Maldus设置use_bias = False之外,您还可以尝试从更高的epsilon值(例如0.5、0.75)开始吗?一个技巧可能是仅在达到目标时才减小ε值.即不要在每个情节的结尾减少epsilon.这样,您的播放器就可以继续随机浏览地图,直到它开始收敛到一条好的路线上为止,然后减小epsilon参数将是一个好主意.

Besides setting use_bias=False as @Maldus mentioned in the comments, another thing you can try is to start with a higher epsilon value (e.g. 0.5, 0.75)? A trick might be to only decrease the epsilon value IF you reach the goal. i.e. don't decrease epsilon on the end of every episode. That way your player can keep on exploring the map randomly, until it starts to converge on a good route, and then it'll be a good idea to reduce the epsilon parameter.

实际上,我在此要点中使用卷积层实现了类似的模型密集层.设法使它在2000集以下正常工作.可能会对其他人有所帮助:)

I've actually implemented a similar model in keras in this gist using Convolutional layers instead of Dense layers. Managed to get it to work in under 2000 episodes. Might be of some help to others :)

这篇关于Keras代码Q学习OpenAI健身房FrozenLake出了点问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆