如何在不重新计算每次迭代后返回控制的强化学习程序中使用Tensorflow Optimizer而不重新计算激活量的情况下? [英] How to use Tensorflow Optimizer without recomputing activations in reinforcement learning program that returns control after each iteration?

查看:161
本文介绍了如何在不重新计算每次迭代后返回控制的强化学习程序中使用Tensorflow Optimizer而不重新计算激活量的情况下?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编辑(1/3/16):相应的github问题

EDIT(1/3/16): corresponding github issue

我正在使用Tensorflow(Python接口)来实现q-learning代理,该代理具有使用stochastic gradient descent训练的函数逼近功能.

I'm using Tensorflow (Python interface) to implement a q-learning agent with function approximation trained using stochastic gradient descent.

在实验的每次迭代中,都会调用代理中的阶跃函数,该阶跃函数根据新的奖励和激活来更新逼近器的参数,然后选择要执行的新动作.

At each iteration of the experiment, a step function in the agent is called that updates the parameters of the approximator based on the new reward and activation, and then chooses a new action to perform.

这是问题所在(使用强化学习术语):

Here is the problem(with reinforcement learning jargon):

  • 代理计算其状态-动作值预测以选择一个动作.
  • 然后将控制权交还给另一个模拟环境中某个步骤的程序.
  • 现在,将为下一次迭代调用代理的step函数.我想使用Tensorflow的Optimizer类为我计算梯度.但是,这既需要我计算了最后一步的状态操作值预测,也需要它们的图.所以:
    • 如果我在整个图形上运行优化器,则必须重新计算状态操作值预测.
    • 但是,如果我将预测(针对所选操作)存储为变量,然后将其作为占位符提供给优化器,则它不再具有计算梯度所需的图.
    • 我不能只在相同的sess.run()语句中运行所有命令,因为我必须放弃控制权并返回选择的操作,以便获得下一个观察和奖励(用于损失目标)功能).
    • The agent computes its state-action value predictions to choose an action.
    • Then gives control back to another program that simulates a step in the environment.
    • Now the agent's step function is called for the next iteration. I want to use Tensorflow's Optimizer class to compute the gradients for me. However, this requires both the state-action value predictions that I computed the last step AND their graph. So:
      • If I run the optimizer on the whole graph, then it has to recompute the state-action value predictions.
      • But, if I store the prediction (for the chosen action) as a variable, then feed it to the optimizer as a placeholder, it no longer has the graph necessary to compute the gradients.
      • I can't just run it all in the same sess.run() the statement, because I have to give up control and return the chosen action in order to get the next observation and reward (to use in the target for the loss function).

      所以,有没有办法(无需加强学习行话):

      So, is there a way that I can (without reinforcement learning jargon):

      1. 计算图形的一部分,返回value1.
      2. 将value1返回到调用程序以计算value2
      3. 在下一次迭代中,将value2用作损失函数的梯度下降函数,而无需重新计算图形中计算value1的部分.

      当然,我已经考虑了显而易见的解决方案:

      Of course, I've considered the obvious solutions:

      1. 只需对梯度进行硬编码:对于我现在使用的非常简单的逼近器来说,这将很容易,但是如果我在大型卷积网络中尝试使用不同的滤波器和激活函数,将非常不便.如果可能的话,我真的很想使用Optimizer类.

      1. Just hardcode the gradients: This would be easy for the really simple approximators I'm using now but would be really inconvenient if I were experimenting with different filters and activation functions in a big convolutional network. I'd really like to use the Optimizer class if possible.

      从代理内调用环境模拟:此系统可以做到这一点,但这会使我的工作变得更加复杂,并消除了许多模块化和结构.所以,我不想这样做.

      Call the environment simulation from within the agent: This system does this, but it would make mine more complicated, and remove a lot of the modularity and structure. So, I don't want to do this.

      我已经多次阅读API和白皮书,但似乎无法提出解决方案.我试图提出一种将目标馈入图形以计算梯度的方法,但无法提出一种自动构建该图形的方法.

      I've read through the API and whitepaper several times, but can't seem to come up with a solution. I was trying to come up with some way to feed the target into a graph to calculate the gradients, but couldn't come up with a way to build that graph automatically.

      如果事实证明这在TensorFlow中还不可能,那么您认为将其实现为新运算符会非常复杂吗? (我已经有两年没有使用C ++了,所以TensorFlow的源代码看起来有些令人生畏.)还是我最好改用Torch之类的东西,它具有强制性的区分Autograd而不是符号性的区分?

      If it turns out this isn't possible in TensorFlow yet, do you think it would be very complicated to implement this as a new operator? (I haven't used C++ in a couple of years, so the TensorFlow source looks a little intimidating.) Or would I be better off switching to something like Torch, which has the imperative differentiation Autograd, instead of symbolic differentiation?

      感谢您抽出宝贵的时间来帮助我解决这个问题.我试图使这一点尽可能简洁.

      Thanks for taking the time to help me out with this. I was trying to make this as concise as I could.

      做进一步的搜索后,我遇到了此先前提出的问题.这与我的稍有不同(他们试图避免在Torch中每次迭代都更新一次LSTM网络),并且还没有任何答案.

      After doing some further searching I came across this previously asked question. It's a little different than mine (they are trying to avoid updating an LSTM network twice every iteration in Torch), and doesn't have any answers yet.

      如果有帮助的话,这里有一些代码:

      Here is some code if that helps:

      '''
      -Q-Learning agent for a grid-world environment.
      -Receives input as raw RGB pixel representation of the screen.
      -Uses an artificial neural network function approximator with one hidden layer
      
      2015 Jonathon Byrd
      '''
      
      import random
      import sys
      #import copy
      from rlglue.agent.Agent import Agent
      from rlglue.agent import AgentLoader as AgentLoader
      from rlglue.types import Action
      from rlglue.types import Observation
      
      import tensorflow as tf
      import numpy as np
      
      world_size = (3,3)
      total_spaces = world_size[0] * world_size[1]
      
      class simple_agent(Agent):
      
          #Contants
          discount_factor = tf.constant(0.5, name="discount_factor")
          learning_rate = tf.constant(0.01, name="learning_rate")
          exploration_rate = tf.Variable(0.2, name="exploration_rate")  # used to be a constant :P
          hidden_layer_size = 12
      
          #Network Parameters - weights and biases
          W = [tf.Variable(tf.truncated_normal([total_spaces * 3, hidden_layer_size], stddev=0.1), name="layer_1_weights"), 
          tf.Variable(tf.truncated_normal([hidden_layer_size,4], stddev=0.1), name="layer_2_weights")]
          b = [tf.Variable(tf.zeros([hidden_layer_size]), name="layer_1_biases"), tf.Variable(tf.zeros([4]), name="layer_2_biases")]
      
          #Input placeholders - observation and reward
          screen = tf.placeholder(tf.float32, shape=[1, total_spaces * 3], name="observation") #input pixel rgb values
          reward = tf.placeholder(tf.float32, shape=[], name="reward")
      
          #last step data
          last_obs = np.array([1, 2, 3], ndmin=4)
          last_act = -1
      
          #Last step placeholders
          last_screen = tf.placeholder(tf.float32, shape=[1, total_spaces * 3], name="previous_observation")
          last_move = tf.placeholder(tf.int32, shape = [], name="previous_action")
      
          next_prediction = tf.placeholder(tf.float32, shape = [], name="next_prediction")
      
          step_count = 0
      
          def __init__(self):
              #Initialize computational graphs
              self.q_preds = self.Q(self.screen)
              self.last_q_preds = self.Q(self.last_screen)
              self.action = self.choose_action(self.q_preds)
              self.next_pred = self.max_q(self.q_preds)
              self.last_pred = self.act_to_pred(self.last_move, self.last_q_preds) # inefficient recomputation
              self.loss = self.error(self.last_pred, self.reward, self.next_prediction)
              self.train = self.learn(self.loss)
              #Summaries and Statistics
              tf.scalar_summary(['loss'], self.loss)
              tf.scalar_summary('reward', self.reward)
              #w_hist = tf.histogram_summary("weights", self.W[0])
              self.summary_op = tf.merge_all_summaries()
              self.sess = tf.Session()
              self.summary_writer = tf.train.SummaryWriter('tensorlogs', graph_def=self.sess.graph_def)
      
      
          def agent_init(self,taskSpec):
              print("agent_init called")
              self.sess.run(tf.initialize_all_variables())
      
          def agent_start(self,observation):
              #print("agent_start called, observation = {0}".format(observation.intArray))
              o = np.divide(np.reshape(np.asarray(observation.intArray), (1,total_spaces * 3)), 255)
              return self.control(o)
      
          def agent_step(self,reward, observation):
              #print("agent_step called, observation = {0}".format(observation.intArray))
              print("step, reward: {0}".format(reward))
              o = np.divide(np.reshape(np.asarray(observation.intArray), (1,total_spaces * 3)), 255)
      
              next_prediction = self.sess.run([self.next_pred], feed_dict={self.screen:o})[0]
      
              if self.step_count % 10 == 0:
                  summary_str = self.sess.run([self.summary_op, self.train], 
                      feed_dict={self.reward:reward, self.last_screen:self.last_obs, 
                      self.last_move:self.last_act, self.next_prediction:next_prediction})[0]
      
                  self.summary_writer.add_summary(summary_str, global_step=self.step_count)
              else:
                  self.sess.run([self.train], 
                      feed_dict={self.screen:o, self.reward:reward, self.last_screen:self.last_obs, 
                      self.last_move:self.last_act, self.next_prediction:next_prediction})
      
              return self.control(o)
      
          def control(self, observation):
              results = self.sess.run([self.action], feed_dict={self.screen:observation})
              action = results[0]
      
              self.last_act = action
              self.last_obs = observation
      
              if (action==0):  # convert action integer to direction character
                  action = 'u'
              elif (action==1):
                  action = 'l'
              elif (action==2):
                  action = 'r'
              elif (action==3):
                  action = 'd'
              returnAction=Action()
              returnAction.charArray=[action]
              #print("return action returned {0}".format(action))
              self.step_count += 1
              return returnAction
      
          def Q(self, obs):  #calculates state-action value prediction with feed-forward neural net
              with tf.name_scope('network_inference') as scope:
                  h1 = tf.nn.relu(tf.matmul(obs, self.W[0]) + self.b[0])
                  q_preds = tf.matmul(h1, self.W[1]) + self.b[1] #linear activation
                  return tf.reshape(q_preds, shape=[4])
      
          def choose_action(self, q_preds):  #chooses action epsilon-greedily
              with tf.name_scope('action_choice') as scope:
                  exploration_roll = tf.random_uniform([])
                  #greedy_action = tf.argmax(q_preds, 0)  # gets the action with the highest predicted Q-value
                  #random_action = tf.cast(tf.floor(tf.random_uniform([], maxval=4.0)), tf.int64)
      
                  #exploration rate updates
                  #if self.step_count % 10000 == 0:
                      #self.exploration_rate.assign(tf.div(self.exploration_rate, 2))
      
                  return tf.select(tf.greater_equal(exploration_roll, self.exploration_rate), 
                      tf.argmax(q_preds, 0),   #greedy_action
                      tf.cast(tf.floor(tf.random_uniform([], maxval=4.0)), tf.int64))  #random_action
      
              '''
              Why does this return NoneType?:
      
              flag = tf.select(tf.greater_equal(exploration_roll, self.exploration_rate), 'g', 'r')
              if flag == 'g':  #greedy
                  return tf.argmax(q_preds, 0) # gets the action with the highest predicted Q-value
              elif flag == 'r':  #random
                  return tf.cast(tf.floor(tf.random_uniform([], maxval=4.0)), tf.int64)
              '''
      
          def error(self, last_pred, r, next_pred):
              with tf.name_scope('loss_function') as scope:
                  y = tf.add(r, tf.mul(self.discount_factor, next_pred)) #target
                  return tf.square(tf.sub(y, last_pred)) #squared difference error
      
      
          def learn(self, loss): #Update parameters using stochastic gradient descent
              #TODO:  Either figure out how to avoid computing the q-prediction twice or just hardcode the gradients.
              with tf.name_scope('train') as scope:
                  return tf.train.GradientDescentOptimizer(self.learning_rate).minimize(loss, var_list=[self.W[0], self.W[1], self.b[0], self.b[1]])
      
      
          def max_q(self, q_preds):
              with tf.name_scope('greedy_estimate') as scope:
                  return tf.reduce_max(q_preds)  #best predicted action from current state
      
          def act_to_pred(self, a, preds): #get the value prediction for action a
              with tf.name_scope('get_prediction') as scope:
                  return tf.slice(preds, tf.reshape(a, shape=[1]), [1])
      
      
          def agent_end(self,reward):
              pass
      
          def agent_cleanup(self):
              self.sess.close()
              pass
      
          def agent_message(self,inMessage):
              if inMessage=="what is your name?":
                  return "my name is simple_agent";
              else:
                  return "I don't know how to respond to your message";
      
      if __name__=="__main__":
          AgentLoader.loadAgent(simple_agent())
      

      推荐答案

      现在,在Tensorflow(0.6)中,您要执行的操作非常困难.最好的选择是硬着头皮打电话多次,以重新计算激活为代价.但是,我们内部非常了解此问题.一个部分运行"解决方案的原型正在开发中,但是目前尚无完成时间表.因为一个真正令人满意的答案可能需要修改tensorflow本身,所以您也可以为此创建一个github问题,看看那里是否还有其他人要说些什么.

      Right now what you want to do is very difficult in Tensorflow (0.6). Your best bet is to bite the bullet and call run multiple times at the cost of recomputing the activations. However, we are very aware of this issue internally. A prototype "partial run" solution is in the works, but there is no timeline for its completion right now. Since a truly satisfactory answer might require modifying tensorflow itself, you could also make a github issue for this and see if anyone else has anything to say on this there.

      现已提供对partial_run的实验性支持.

      Experimental support for partial_run is now in. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/client/session.py#L317

      这篇关于如何在不重新计算每次迭代后返回控制的强化学习程序中使用Tensorflow Optimizer而不重新计算激活量的情况下?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆