强化学习对这款非常简单的游戏不起作用,为什么?问学习 [英] Reinforcement Learning doesn't work for this VERY EASY game, why? Q Learning

查看:42
本文介绍了强化学习对这款非常简单的游戏不起作用,为什么?问学习的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了一个非常简单的游戏,其工作方式如下:

I programmed a very easy game which works the following way:

给定一个 4x4 的方格区域,玩家可以移动(向上、向右、向下或向左).

Given an 4x4 field of squares, a player can move (up, right, down or left).

  • 进入一个智能体从未去过的正方形,奖励为 1.

  • Going on a square the agent never visited before gives the reward 1.

踩到死地"会得到 -5 的奖励,然后游戏将被重置.

Stepping on "dead-field" is rewarded with -5 and then the game will be resetted.

在已经访问过的字段上移动会得到 -1

Moving on a field that was already visited is rewarded with -1

进入胜利场"(只有一个)会得到奖励 5,游戏也将重置.

Going on the "win-field" (there's exactly one) gives the reward 5 and the game will be resetted as well.

现在我想让 AI 通过 Q-Learning 来学习玩那个游戏.

Now I want an AI to learn to play that game via Q-Learning.

我如何组织输入/特征工程:

How I organized the Inputs / feature engineering:

网络的输入是一个形状为 1x4 的数组,其中 arr[0] 代表上方的字段(向上移动时),arr[1] 代表向右的字段,arr[2] 代表下方的字段,arr[3] 左边那个.

An input for the net is an array with the shape 1x4 where arr[0] represents the field above (when moving up), arr[1] represents the field to the right, arr[2] the one below, arr[3] the one to the left.

数组可以容纳的可能值:0, 1, 2, 3

Possible values the array can hold: 0, 1, 2, 3

0 = "dead field", 所以是最坏的情况

0 = "dead field", so the worst case

1 = 这将在 4x4 字段之外(因此您无法进入)或该字段已被访问

1 = this would be outside of the 4x4 field (so you can't step there) or the field was already visited

2 = 未访问的字段(所以这是好事)

2 = unvisited field (so that is something good)

3 = "win field", 所以最好的情况

3 = "win field", so the best-case

如您所见,我是根据他们的奖励订购的.

As you see, I ordered them by their reward.

由于游戏以相同的方式接受输入(0 = 向上移动,1 = 向右移动,2 = 向下移动,3 = 向左移动),AI 唯一需要学习的基本上是:选择保存最大值的数组索引.

Since the game takes an input the same way (0 = move up, 1 = move to the right, 2 = move down, 3 = move to the left), the only thing the AI would have to learn is basically: Choose the array index that holds the greatest value.

但不幸的是它不起作用,网络就是不学习,即使在 30.000 集之后也不行.

But unforntunately it doesn't work, the net just doesn't learn, not even after 30.000 episodes.

这是我的代码(包括开头的游戏):

Here's my code (including the game at the beginning):

import numpy as np
import random
Import tensorflow as tf
import matplotlib.pyplot as plt

from time import sleep

episoden = 0

felder = []
schon_besucht = []

playerx = 0
playery = 0

grafik = False

def gib_zustand():
    # besonderes feature engineering:
    # input besteht nur aus einer richtung, die one-hot-encoded ist; also 4 inputneuronen
    # (glut, wand/besucht, unbesucht, sieg)
    #
    # es ist die richtung, die bewertet werden soll (also 1 outputneuron fuer eine richtung)

    # rueckgabe hier: array, shape: 4x4 (s.o.)

    global playerx
    global playery

    # oben 
    if playery == 0:
        oben = 1
    else:
        oben = felder[playery-1][playerx]

    # rechts
    if playerx == 4:
        rechts = 1
    else:
        rechts = felder[playery][playerx+1]

    # unten
    if playery == 4:
        unten = 1
    else:
        unten = felder[playery+1][playerx]

    # links
    if playerx == 0:
        links = 1
    else:
        links = felder[playery][playerx-1]

    return np.array([oben, rechts, unten, links])

def grafisch():
    if grafik:

        # encoding:
        # glut = G, besucht = b, unbesucht = , sieg = S, Spieler = X
        global felder
        global playerx
        global playery

        print('')

        for y in range(0,5):
            print('|', end='')
            for x in range(0,5):
                if felder[y][x] == 0:
                    temp = 'G'
                if felder[y][x] == 1:
                    temp = 'b'
                if felder[y][x] == 2:
                    temp = ' '
                if felder[y][x] == 3:
                    temp = 'S'
                if y == playery and x == playerx:
                    temp = 'X'

                print(temp, end='')
                print('|', end='')
            print('')

def reset():
    print('--- RESET ---')

    global playery
    global playerx
    global felder
    global schon_besucht

    playerx = 1
    playery = 3

    # anordnung
    # glut = 0, wand/besucht = 1, unbesucht = 2, sieg = 3

    felder = [[2 for x in range(0,5)] for y in range(0,5)]
    # zwei mal glut setzen
    gl1 = random.randint(1,3)
    gl1_1 = random.randint(2,3) if gl1==3 else (random.randint(1,2) if gl1==1 else random.randint(1,3))
    felder[gl1][gl1_1] = 0 # glut

    # zweites mal
    gl1 = random.randint(1,3)
    gl1_1 = random.randint(2,3) if gl1==3 else (random.randint(1,2) if gl1==1 else random.randint(1,3))
    felder[gl1][gl1_1] = 0 # glut

    # pudding
    felder[1][3] = 3

    # ruecksetzen
    schon_besucht = []

    grafisch()

    return gib_zustand()

def step(zug):
    # 0 = oben, 1 = rechts, 2 = unten, 3 = links
    global playerx
    global playery
    global felder
    global schon_besucht

    if zug == 0:
        if playery != 0:
            playery -= 1
    if zug == 1:
        if playerx != 4:
            playerx += 1
    if zug == 2:
        if playery != 4:
            playery += 1
    if zug == 3:
        if playerx != 0:
            playerx -= 1

    # belohnung holen
    wert = felder[playery][playerx]

    if wert==0:
        belohnung = -5
    if wert==1:
        belohnung = -1
    if wert==2:
        belohnung = 1
    if wert==3:
        belohnung = 5

    # speichern wenn nicht verloren
    if belohnung != -5:
        schon_besucht.append((playery,playerx))
        felder[playery][playerx] = 1

    grafisch()

    return gib_zustand(), belohnung, belohnung==5, 0 # 0 damits passt

episoden = 0

tf.reset_default_graph()

#These lines establish the feed-forward part of the network used to choose actions
inputs1 = tf.placeholder(shape=[1,4],dtype=tf.float32)
#W1 = tf.Variable(tf.random_uniform([16,8],0,0.01))
W2 = tf.Variable(tf.random_uniform([4,4],0,0.01))
#schicht2 = tf.matmul(inputs1,W1)
Qout = tf.matmul(inputs1,W2)
predict = tf.argmax(Qout,1)

#Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
nextQ = tf.placeholder(shape=[1,4],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(nextQ - Qout))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
updateModel = trainer.minimize(loss)

init = tf.initialize_all_variables()

# Set learning parameters
y = .99
e = 0.1
num_episodes = 10_000
#create lists to contain total rewards and steps per episode
jList = []
rList = []
with tf.Session() as sess:
    sess.run(init)
    for i in range(num_episodes):             
        #Reset environment and get first new observation
        s = reset()
        rAll = 0
        d = False
        j = 0
        #The Q-Network        
        while j < 99:
            j+=1
            #Choose an action by greedily (with e chance of random action) from the Q-network
            a,allQ = sess.run([predict,Qout],feed_dict={inputs1:s.reshape(1,4)}) # berechnet prediction fuer input (input scheint hier one hot encoded zu sein)
            if np.random.rand(1) < e:
                a[0] = random.randint(0,3)                 

            #Get new state and reward from environment
            s1,r,d,_ = step(a[0])
            #Obtain the Q' values by feeding the new state through our network
            Q1 = sess.run(Qout,feed_dict={inputs1:s1.reshape(1,4)})
            #Obtain maxQ' and set our target value for chosen action.
            maxQ1 = np.max(Q1)


            targetQ = allQ
            targetQ[0,a[0]] = r + y*maxQ1
            #Train our network using target and predicted Q values

            _,W1 = sess.run([updateModel,W2],feed_dict={inputs1:s.reshape(1,4),nextQ:targetQ})
            rAll += r
            s = s1

            if r == -5 or r == 5:
                if r == 5:
                    episoden+=1

                reset()

                #Reduce chance of random action as we train the model.
                e = 1./((i/50) + 10)
                break
        jList.append(j)
        #print(rAll)
        rList.append(rAll)
print("Percent of succesful episodes: " + str((episoden/num_episodes)*100) + "%")
plt.plot(rList)
plt.plot(jList)

我在一个类似的问题中读到,Q 值过高的原因可能是,代理实际上有可能在游戏中获得无限高的总奖励.这里就是这种情况,如果代理可以踩到已经访问过的字段并获得 1 的奖励.当然,可能的总奖励将是无穷大.但这里的情况并非如此:当玩家这样做时,他会得到不好的奖励 (-1).小计算:胜利场奖励5.未访问场奖励1.至少有一个死区.总共有16个字段.最大可能的总奖励:14*1 + 1*5 = 19

I read in a simular question, that a reason for too high Q-Values can be, that it is in fact possible for the agent to get unlimited high total rewards in a game. That would be the case here, if the agent could step on already visited fields and would get a reward of 1. Then of course, the possible total reward would be infinity. But that isn't the case here: The player gets a bad reward (-1) when he does that. Little calculation: win-field is rewarded with 5. Unvisited field with 1. There's at least one dead-field. All in all there are 16 fields. Max possible total reward: 14*1 + 1*5 = 19

推荐答案

我终于找到了解决方案,我花了将近一周的时间.

I finally found a solution for that, it took me almost a week.

关键是让我的输入单热编码.这使我有 16 个输入神经元而不是 4 个,但现在它可以工作了.在 1.000 集之后,我的成功率大多在 91% 左右.

The key was to have my inputs being one-hot-encoded. This makes me have 16 input-neurons instead of 4, but it works now. After 1.000 episodes I have mostly around 91 % successfull episodes.

我仍然想知道这样一个事实,即当输入不是单热编码时它不起作用.我知道 ANN 会自动使用神经元接受的不同输入之间的更大-更小关系,这可能是一个缺点.但由于我以这种方式对输入进行排序,如果一个输入大于另一个输入,这也意味着输出应该以同样的方式更大.因此,如果 ANN 使用这些关系,这里没有任何缺点,相反,这应该是一个优点.

I'm still wondering about the fact, that it didn't work when the input wasn't one-hot-encoded. I know that an ANN will automatically use the greater-smaller relations between the different inputs a neuron takes, what can be a disadvantage. But since I orderd my inputs that way, that if one input is greater than a different one, that also means, that the output should be greater the same way. So there is no disadvantage here if the ANN uses the relations, contrarily, that should be an advantage.

因此,我认为不对输入进行单热编码会很好,因为这样我会极大地降低维度(4 个而不是 16 个).

Therefore I thought it would be good to not one-hot-encode the inputs because that way I reduce the dimensionality immensely (4 instead of 16).

显然这个想法行不通.

然而,正如我所说,现在有 16 个输入它可以工作.

However, as I said, with 16 Inputs it works now.

这篇关于强化学习对这款非常简单的游戏不起作用,为什么?问学习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆