TF Agents:如何将伪造的观察结果输入到训练有素的深度 Q 网络模型中,以检查它选择了哪些动作? [英] TF Agents: How to feed faked observations in to a trained deep Q network model to examine which actions it chooses?

查看:39
本文介绍了TF Agents:如何将伪造的观察结果输入到训练有素的深度 Q 网络模型中,以检查它选择了哪些动作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下问题中引用的所有链接描述均来自 2021/05/31.

All descriptions of links referenced in the question below are from 2021/05/31.

我按照 TF 代理教程的版本训练了一个深度 Q 网络关于自定义问题.现在我想给它提供一些手工制作的观察结果,看看它推荐什么操作.我有一些实用函数来创建我在 PyEnvironment 中使用的这些特征向量.但是,我不确定如何将这些位转换为输入网络.

I have trained a deep Q network following the version of the TF Agents tutorial on a custom problem. Now I would like to feed it some hand-crafted observations to see what actions it recommends. I have some utility functions for creating these feature vectors that I use in my PyEnvironment. However, I am not sure how to convert these bits to feed into the network.

我想要的是以下内容:

  1. 进入初始状态,并查看来自网络的推荐操作.
  2. 手动更改状态,然后查看网络推荐的内容.
  3. 等等...

我的环境有一个随机组件,所以我想手动修改环境状态,而不是让代理显式地通过环境.

My environment has a stochastic component, so I want to manually modify the environment state rather than have the agent explicitly take a path through the environment.

为了在这个问题上取得进展,我一直在研究这个政策教程.看起来,我的用例可能类似于随机 TF 策略"部分.或以下有关演员政策"的内容.但是,在我的用例中,我有一个加载的代理,并且有 Python(非 TF)观察、时间规范和动作规范.驱动我的网络从这些组件产生操作的理想方法是什么?

To make progress on this question, I have been examining this tutorial on policies. It looks like, my use case might be similar to the section "Random TF Policy" or the one below on "Actor policies". However, in my use case I have a loaded agent and have Python (non TF) observation, time specs, and action specs. What is the ideal approach to drive my network to produce actions from these components?

这是我尝试过的:

saved_policy = tf.compat.v2.saved_model.load(policy_dir)
# get_feat_vector returns an numpy.ndarray
observation = tf.convert_to_tensor(state.get_feat_vector(), dtype=tf.float32)
time_step = ts.restart(observation)
action_step = saved_policy.action(time_step)

以及相关的错误信息:

File "/home/---/.local/lib/python3.8/site-packages/tensorflow/python/saved_model/function_deserialization.py", line 267, in restored_function_body
    raise ValueError(
ValueError: Could not find matching function to call loaded from the SavedModel. Got:
  Positional arguments (2 total):
    * TimeStep(step_type=<tf.Tensor 'time_step:0' shape=() dtype=int32>, reward=<tf.Tensor 'time_step_1:0' shape=() dtype=float32>, discount=<tf.Tensor 'time_step_2:0' shape=() dtype=float32>, observation=<tf.Tensor 'time_step_3:0' shape=(170,) dtype=float32>)
    * ()
  Keyword arguments: {}

Expected these arguments to match one of the following 2 option(s):

Option 1:
  Positional arguments (2 total):
    * TimeStep(step_type=TensorSpec(shape=(None,), dtype=tf.int32, name='step_type'), reward=TensorSpec(shape=(None,), dtype=tf.float32, name='reward'), discount=TensorSpec(shape=(None,), dtype=tf.float32, name='discount'), observation=TensorSpec(shape=(None, 170), dtype=tf.float32, name='observation'))
    * ()
  Keyword arguments: {}

Option 2:
  Positional arguments (2 total):
    * TimeStep(step_type=TensorSpec(shape=(None,), dtype=tf.int32, name='time_step/step_type'), reward=TensorSpec(shape=(None,), dtype=tf.float32, name='time_step/reward'), discount=TensorSpec(shape=(None,), dtype=tf.float32, name='time_step/discount'), observation=TensorSpec(shape=(None, 170), dtype=tf.float32, name='time_step/observation'))
    * ()
  Keyword arguments: {}

推荐答案

我相信您的问题可能与您加载和保存模型的方式有关.TF-Agents 建议使用 PolicySaver(请参阅此处).所以也许尝试运行像

I believe your problem might be with how you are loading and saving the model. TF-Agents recommends using the PolicySaver (see here). So maybe try running code like

tf_agent = ...
tf_policy_saver = policy_saver.PolicySaver(policy=tf_agent.policy)

... # train agent

tf_policy_saver.save(export_dir=policy_dir_path)

然后加载并运行模型:

eager_py_policy = py_tf_eager_policy.SavedModelPyTFEagerPolicy(
    policy_dir, env.time_step_spec(), env.action_spec())

policy_state = eager_py_policy.get_initial_state(1)
time_step = env.reset()
action_step = eager_py_policy.action(time_step, policy_state)
time_step = env.step(action_step.action)
policy_state = action_step.state

或者您想对环境和观察进行的任何手动操作.

Or whatever manual thing you want to do with the environment and observations.

这篇关于TF Agents:如何将伪造的观察结果输入到训练有素的深度 Q 网络模型中,以检查它选择了哪些动作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆