强化学习对这款非常简单的游戏不起作用，为什么?问学习 [英] Reinforcement Learning doesn't work for this VERY EASY game, why? Q Learning

查看：42 发布时间：2021/7/7 18:57:03 python tensorflow reinforcement-learning

本文介绍了强化学习对这款非常简单的游戏不起作用，为什么?问学习的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我编写了一个非常简单的游戏，其工作方式如下:

I programmed a very easy game which works the following way:

给定一个 4x4 的方格区域，玩家可以移动(向上、向右、向下或向左).

Given an 4x4 field of squares, a player can move (up, right, down or left).

进入一个智能体从未去过的正方形，奖励为 1.

Going on a square the agent never visited before gives the reward 1.

踩到死地"会得到 -5 的奖励，然后游戏将被重置.

Stepping on "dead-field" is rewarded with -5 and then the game will be resetted.

在已经访问过的字段上移动会得到 -1

Moving on a field that was already visited is rewarded with -1

进入胜利场"(只有一个)会得到奖励 5，游戏也将重置.

Going on the "win-field" (there's exactly one) gives the reward 5 and the game will be resetted as well.

现在我想让 AI 通过 Q-Learning 来学习玩那个游戏.

Now I want an AI to learn to play that game via Q-Learning.

我如何组织输入/特征工程:

How I organized the Inputs / feature engineering:

网络的输入是一个形状为 1x4 的数组，其中 arr[0] 代表上方的字段(向上移动时)，arr[1] 代表向右的字段，arr[2] 代表下方的字段，arr[3] 左边那个.

An input for the net is an array with the shape 1x4 where arr[0] represents the field above (when moving up), arr[1] represents the field to the right, arr[2] the one below, arr[3] the one to the left.

数组可以容纳的可能值:0, 1, 2, 3

Possible values the array can hold: 0, 1, 2, 3

0 = "dead field", 所以是最坏的情况

0 = "dead field", so the worst case

1 = 这将在 4x4 字段之外(因此您无法进入)或该字段已被访问

1 = this would be outside of the 4x4 field (so you can't step there) or the field was already visited

2 = 未访问的字段(所以这是好事)

2 = unvisited field (so that is something good)

3 = "win field", 所以最好的情况

3 = "win field", so the best-case

如您所见，我是根据他们的奖励订购的.

As you see, I ordered them by their reward.

由于游戏以相同的方式接受输入(0 = 向上移动，1 = 向右移动，2 = 向下移动，3 = 向左移动)，AI 唯一需要学习的基本上是:选择保存最大值的数组索引.

Since the game takes an input the same way (0 = move up, 1 = move to the right, 2 = move down, 3 = move to the left), the only thing the AI would have to learn is basically: Choose the array index that holds the greatest value.

但不幸的是它不起作用，网络就是不学习，即使在 30.000 集之后也不行.

But unforntunately it doesn't work, the net just doesn't learn, not even after 30.000 episodes.

这是我的代码(包括开头的游戏):

Here's my code (including the game at the beginning):

import numpy as np
import random
Import tensorflow as tf
import matplotlib.pyplot as plt

from time import sleep

episoden = 0

felder = []
schon_besucht = []

playerx = 0
playery = 0

grafik = False

def gib_zustand():
    # besonderes feature engineering:
    # input besteht nur aus einer richtung, die one-hot-encoded ist; also 4 inputneuronen
    # (glut, wand/besucht, unbesucht, sieg)
    #
    # es ist die richtung, die bewertet werden soll (also 1 outputneuron fuer eine richtung)

    # rueckgabe hier: array, shape: 4x4 (s.o.)

    global playerx
    global playery

    # oben 
    if playery == 0:
        oben = 1
    else:
        oben = felder[playery-1][playerx]

    # rechts
    if playerx == 4:
        rechts = 1
    else:
        rechts = felder[playery][playerx+1]

    # unten
    if playery == 4:
        unten = 1
    else:
        unten = felder[playery+1][playerx]

    # links
    if playerx == 0:
        links = 1
    else:
        links = felder[playery][playerx-1]

    return np.array([oben, rechts, unten, links])

def grafisch():
    if grafik:

        # encoding:
        # glut = G, besucht = b, unbesucht = , sieg = S, Spieler = X
        global felder
        global playerx
        global playery

        print('')

        for y in range(0,5):
            print('|', end='')
            for x in range(0,5):
                if felder[y][x] == 0:
                    temp = 'G'
                if felder[y][x] == 1:
                    temp = 'b'
                if felder[y][x] == 2:
                    temp = ' '
                if felder[y][x] == 3:
                    temp = 'S'
                if y == playery and x == playerx:
                    temp = 'X'

                print(temp, end='')
                print('|', end='')
            print('')

def reset():
    print('--- RESET ---')

    global playery
    global playerx
    global felder
    global schon_besucht

    playerx = 1
    playery = 3

    # anordnung
    # glut = 0, wand/besucht = 1, unbesucht = 2, sieg = 3

    felder = [[2 for x in range(0,5)] for y in range(0,5)]
    # zwei mal glut setzen
    gl1 = random.randint(1,3)
    gl1_1 = random.randint(2,3) if gl1==3 else (random.randint(1,2) if gl1==1 else random.randint(1,3))
    felder[gl1][gl1_1] = 0 # glut

    # zweites mal
    gl1 = random.randint(1,3)
    gl1_1 = random.randint(2,3) if gl1==3 else (random.randint(1,2) if gl1==1 else random.randint(1,3))
    felder[gl1][gl1_1] = 0 # glut

    # pudding
    felder[1][3] = 3

    # ruecksetzen
    schon_besucht = []

    grafisch()

    return gib_zustand()

def step(zug):
    # 0 = oben, 1 = rechts, 2 = unten, 3 = links
    global playerx
    global playery
    global felder
    global schon_besucht

    if zug == 0:
        if playery != 0:
            playery -= 1
    if zug == 1:
        if playerx != 4:
            playerx += 1
    if zug == 2:
        if playery != 4:
            playery += 1
    if zug == 3:
        if playerx != 0:
            playerx -= 1

    # belohnung holen
    wert = felder[playery][playerx]

    if wert==0:
        belohnung = -5
    if wert==1:
        belohnung = -1
    if wert==2:
        belohnung = 1
    if wert==3:
        belohnung = 5

    # speichern wenn nicht verloren
    if belohnung != -5:
        schon_besucht.append((playery,playerx))
        felder[playery][playerx] = 1

    grafisch()

    return gib_zustand(), belohnung, belohnung==5, 0 # 0 damits passt

episoden = 0

tf.reset_default_graph()

#These lines establish the feed-forward part of the network used to choose actions
inputs1 = tf.placeholder(shape=[1,4],dtype=tf.float32)
#W1 = tf.Variable(tf.random_uniform([16,8],0,0.01))
W2 = tf.Variable(tf.random_uniform([4,4],0,0.01))
#schicht2 = tf.matmul(inputs1,W1)
Qout = tf.matmul(inputs1,W2)
predict = tf.argmax(Qout,1)

#Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
nextQ = tf.placeholder(shape=[1,4],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(nextQ - Qout))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
updateModel = trainer.minimize(loss)

init = tf.initialize_all_variables()

# Set learning parameters
y = .99
e = 0.1
num_episodes = 10_000
#create lists to contain total rewards and steps per episode
jList = []
rList = []
with tf.Session() as sess:
    sess.run(init)
    for i in range(num_episodes):             
        #Reset environment and get first new observation
        s = reset()
        rAll = 0
        d = False
        j = 0
        #The Q-Network        
        while j < 99:
            j+=1
            #Choose an action by greedily (with e chance of random action) from the Q-network
            a,allQ = sess.run([predict,Qout],feed_dict={inputs1:s.reshape(1,4)}) # berechnet prediction fuer input (input scheint hier one hot encoded zu sein)
            if np.random.rand(1) < e:
                a[0] = random.randint(0,3)                 

            #Get new state and reward from environment
            s1,r,d,_ = step(a[0])
            #Obtain the Q' values by feeding the new state through our network
            Q1 = sess.run(Qout,feed_dict={inputs1:s1.reshape(1,4)})
            #Obtain maxQ' and set our target value for chosen action.
            maxQ1 = np.max(Q1)


            targetQ = allQ
            targetQ[0,a[0]] = r + y*maxQ1
            #Train our network using target and predicted Q values

            _,W1 = sess.run([updateModel,W2],feed_dict={inputs1:s.reshape(1,4),nextQ:targetQ})
            rAll += r
            s = s1

            if r == -5 or r == 5:
                if r == 5:
                    episoden+=1

                reset()

                #Reduce chance of random action as we train the model.
                e = 1./((i/50) + 10)
                break
        jList.append(j)
        #print(rAll)
        rList.append(rAll)
print("Percent of succesful episodes: " + str((episoden/num_episodes)*100) + "%")
plt.plot(rList)
plt.plot(jList)

我在一个类似的问题中读到，Q 值过高的原因可能是，代理实际上有可能在游戏中获得无限高的总奖励.这里就是这种情况，如果代理可以踩到已经访问过的字段并获得 1 的奖励.当然，可能的总奖励将是无穷大.但这里的情况并非如此:当玩家这样做时，他会得到不好的奖励 (-1).小计算:胜利场奖励5.未访问场奖励1.至少有一个死区.总共有16个字段.最大可能的总奖励:14*1 + 1*5 = 19

I read in a simular question, that a reason for too high Q-Values can be, that it is in fact possible for the agent to get unlimited high total rewards in a game. That would be the case here, if the agent could step on already visited fields and would get a reward of 1. Then of course, the possible total reward would be infinity. But that isn't the case here: The player gets a bad reward (-1) when he does that. Little calculation: win-field is rewarded with 5. Unvisited field with 1. There's at least one dead-field. All in all there are 16 fields. Max possible total reward: 14*1 + 1*5 = 19

强化学习对这款非常简单的游戏不起作用，为什么?问学习 [英] Reinforcement Learning doesn't work for this VERY EASY game, why? Q Learning

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

强化学习对这款非常简单的游戏不起作用，为什么?问学习 [英] Reinforcement Learning doesn&#39;t work for this VERY EASY game, why? Q Learning

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

强化学习对这款非常简单的游戏不起作用，为什么?问学习 [英] Reinforcement Learning doesn't work for this VERY EASY game, why? Q Learning

登录关闭