OpenAI健身房breakout-ram-v4无法学习 [英] OpenAI gym breakout-ram-v4 unable to learn

查看:90
本文介绍了OpenAI健身房breakout-ram-v4无法学习的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Q learning 并且程序应该能够在一些尝试后玩游戏,但即使 epsilon 值为 0.1,它也无法学习.

I am using Q learning and the program should be able to play the game after some tries but it is not learning even when the epsilon value if 0.1.

我尝试将批量大小更改为内存大小.如果玩家死亡,我已更改代码以提供 -1 奖励.

I have tried changing the batch size the memory size. I have changed the code to give -1 reward if the player dies.

import gym 
import numpy as np  
import random
import tensorflow as tf
import numpy as np
from time import time
import keyboard
import sys
import time


env = gym.make("Breakout-ram-v4")
observationSpace = env.observation_space
actionSpace=  env.action_space
episode = 500

class Model_QNN :
    def __init__(self):
        self.memory = []
        self.MAX_MEMORY_TO_USE = 60_000
        self.gamma = 0.9
        self.model = tf.keras.Sequential([
                tf.keras.layers.Flatten(input_shape=(128,1)),
                tf.keras.layers.Dense(256,activation="relu"),
                tf.keras.layers.Dense(64,activation="relu"),
                tf.keras.layers.Dense(actionSpace.n , activation=  "softmax")
            ])
        self.model.compile(optimizer="adam",loss="mse",metrics=["accuracy"])

    def remember(self, steps , done):
        self.memory.append([steps,done])
        if(len(self.memory) >= self.MAX_MEMORY_TO_USE):
            del self.memory[0]
    def replay(self,batch_size= 32):
        states, targets_f = [], []
        if(len(self.memory)< batch_size) :
            return 
        else: 
            mini = random.sample(self.memory,batch_size)
            states ,targets  = [],  [] 
            for steps , done  in mini :
                target= steps[2] ;
                if not done :
                    target = steps[2]  + (self.gamma* np.amax(self.model.predict(steps[3].reshape(1,128,1))[0]))
                target_f = self.model.predict(steps[0].reshape(1,128,1))
                target_f[0][steps[1]] = target
                states.append(steps[0])
                targets.append(target_f[0])
            self.model.fit(np.array(states).reshape(len(states),128,1), np.array(targets),verbose=0,epochs=10)
    def act(self,state,ep):
        if(random.random()< ep):
            action = actionSpace.sample()
        else :
            np.array([state]).shape
            action= self.model.predict(state.reshape(1,128,1))
            action = np.argmax(action)
        return  action;
    def saveModel (self):
        print("Saving")
        self.model.save("NEWNAMEDONE")
    def saveBackup(self,num):
        self.model.save("NEWNAME"+str(int(num)))
def main():
    agent= Model_QNN();
    epsilon=0.9
    t_end = time.time()
    score=  0
    for e in range(2000):
        print("Working on episode : "+str(e)+" eps "+str(epsilon)+" Score  " + str(score))
        preState = env.reset()
        preState,reward,done,_ = env.step(1)
        mainLife=5
        done = False
        score=  0
        icount = 0
        render=False
        if e % 400 ==0 and not e==0:
            render =True
        while not done:
            icount+=1
            if render:
                env.render()
            if keyboard.is_pressed('q'):
                agent.saveBackup(100)
                agent.saveModel()
                quit()
            rewrd=0
            if ( _["ale.lives"] < mainLife ):
                mainLife-=1
                rewrd=-1
                action=1
            else: 
                action = agent.act(preState,epsilon)
            newState,reward,done,_ = env.step(action)
            if rewrd ==-1 :
                reward =-1
            agent.remember([preState/255,action,reward,newState/255],done);
            preState= newState;
            score+=reward 
            if done :
                break
        agent.replay(1024)
        if epsilon >= 0.18 :
           epsilon = epsilon * 0.995;
        if ((e+1)%500==0):
            agent.saveBackup((e+1)/20)
    agent.saveModel()


if __name__=='__main__':
    main()

没有程序应该学习的错误信息,它不是

There is no error message the program should learn and it is not

推荐答案

为什么在输出层上使用 Softmax?如果您想使用 Softmax,请使用交叉熵作为您的损失.但是,您似乎正在尝试实施基于价值的学习系统.输出层上的激活函数应该是线性的.

Why are you using Softmax on your output layer? If you want to use Softmax use Cross-Entropy as your loss. However, it looks like you're trying to implement a value based learning system. The activation function on your output layer should be linear.

我建议您先在 Cartpole-v0 上尝试您的实现,然后在 LunarLanding-v2 上尝试.这些是已解决的环境,是对代码进行健全性检查的好地方.

I suggest you try your implementation on Cartpole-v0 then LunarLanding-v2 first. Those are solved environments and a great place to sanity check your code.

没有程序应该学习的错误消息,但事实并非如此."欢迎使用机器学习,这里的事情会默默地失败.

"There is no error message the program should learn and it is not." Welcome to ML where things fail silently.

这篇关于OpenAI健身房breakout-ram-v4无法学习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆