神经网络与时差学习 [英] Neural Network and Temporal Difference Learning

本文介绍了神经网络与时差学习的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我阅读了一些有关时差学习的论文和讲座(有些与神经网络有关,例如TD-Gammon的Sutton教程),但是我很难理解方程式,这使我无法理解问题.

-预测值V_t来自哪里?然后,我们如何获得V_(t + 1)?

-当TD与神经网络一起使用时,到底传播了什么?也就是说,使用TD时返回的错误从何而来?

解决方案

后向视图和前向视图可能会造成混淆,但是当您处理诸如游戏程序之类的简单事物时,实际上这些事情实际上非常简单.我查看的不是您正在使用的参考,所以让我提供一个概述.

假设我有一个像神经网络这样的函数逼近器,并且它有两个函数trainpredict,用于训练特定的输出并预测状态的结果. (或在给定状态下采取行动的结果.)

假设我在玩游戏时有一定的玩法,我在其中使用predict方法告诉我在每个点上要做什么,并假设我在游戏结束时输了(V = 0).假设我的状态是s_1,s_2,s_3 ... s_n.

蒙特卡洛方法说,我使用轨迹和最终得分在轨迹中的每个状态上训练函数逼近器(例如我的神经网络).因此,鉴于此跟踪,您将执行类似call的操作:

train(s_n, 0) train(s_n-1, 0) ... train(s_1, 0).

也就是说,我要每个州都预测痕迹的最终结果.

动态编程方法说,我根据下一个状态的结果进行训练.所以我的训练就像

train(s_n, 0) train(s_n-1, test(s_n)) ... train(s_1, test(s_2)).

也就是说,我要让函数逼近器预测下一个状态的预测,最后一个状态的预测是轨迹的最终结果.

TD学习在这两者之间进行了混合,其中λ=1对应于第一种情况(蒙特卡洛),λ=0对应于第二种情况(动态编程).假设我们使用λ=0.5.那么我们的训练将是:

train(s_n, 0) train(s_n-1, 0.5*0 + 0.5*test(s_n)) train(s_n-2, 0.25*0 + 0.25*test(s_n) + 0.5*test(s_n-1)+) ...

现在,我在这里写的内容并不完全正确,因为您实际上并没有在每一步都重新测试逼近器.相反,您只是从一个预测值(在我们的示例中为V = 0)开始,然后对其进行更新以使用下一个预测值训练下一步. V = λ·V + (1-λ)·test(s_i).

这比蒙特卡洛方法和动态编程方法学习得快得多,因为您并没有要求算法学习这样的极值. (忽略当前预测或忽略最终结果.)

I have a read few papers and lectures on temporal difference learning (some as they pertain to neural nets, such as the Sutton tutorial on TD-Gammon) but I am having a difficult time understanding the equations, which leads me to my questions.

-Where does the prediction value V_t come from? And subsequently, how do we get V_(t+1)?

-What exactly is getting back propagated when TD is used with a neural net? That is, where does the error that gets back propagated come from when using TD?

解决方案

The backward and forward views can be confusing, but when you are dealing with something simple like a game-playing program, things are actually pretty simple in practice. I'm not looking at the reference you're using, so let me just provide a general overview.

Suppose I have a function approximator like a neural network, and that it has two functions, train and predict for training on a particular output and predicting the outcome of a state. (Or the outcome of taking an action in a given state.)

Suppose I have a trace of play from playing a game, where I used the predict method to tell me what move to make at each point and suppose that I lose at the end of the game (V=0). Suppose my states are s_1, s_2, s_3...s_n.

The monte-carlo approach says that I train my function approximator (e.g. my neural network) on each of the states in the trace using the trace and the final score. So, given this trace, you would do something like call:

train(s_n, 0) train(s_n-1, 0) ... train(s_1, 0).

That is, I'm asking every state to predict the final outcome of the trace.

The dynamic programming approach says that I train based on the result of the next state. So my training would be something like

train(s_n, 0) train(s_n-1, test(s_n)) ... train(s_1, test(s_2)).

That is, I'm asking the function approximator to predict what the next state predicts, where the last state predicts the final outcome from the trace.

TD learning mixes between the two of these, where λ=1 corresponds to the first case (monte carlo) and λ=0 corresponds to the second case (dynamic programming). Suppose that we use λ=0.5. Then our training would be:

train(s_n, 0) train(s_n-1, 0.5*0 + 0.5*test(s_n)) train(s_n-2, 0.25*0 + 0.25*test(s_n) + 0.5*test(s_n-1)+) ...

Now, what I've written here isn't completely correct, because you don't actually re-test the approximator at each step. Instead you just start with a prediction value (V = 0 in our example) and then you update it for training the next step with the next predicted value. V = λ·V + (1-λ)·test(s_i).

This learns much faster than monte carlo and dynamic programming methods because you aren't asking the algorithm to learn such extreme values. (Ignoring the current prediction or ignoring the final outcome.)

这篇关于神经网络与时差学习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆