tf.agent 策略可以为所有动作返回概率向量吗? [英] Can tf.agent policy return probability vector for all actions?

查看：83 发布时间：2021/7/7 18:56:20 python tensorflow2.0 reinforcement-learning tensorflow-agents

本文介绍了tf.agent 策略可以为所有动作返回概率向量吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 TF-Agent 训练强化学习代理 TF-Agent DQN 教程.在我的应用程序中，我有 1 个动作，其中包含 9 个可能的离散值(标记为 0 到 8).下面是 env.action_spec()

I am trying to train a Reinforcement Learning agent using TF-Agent TF-Agent DQN Tutorial. In my application, I have 1 action containing 9 possible discrete values (labeled from 0 to 8). Below is the output from env.action_spec()

BoundedTensorSpec(shape=(), dtype=tf.int64, name='action', minimum=array(0, dtype=int64), maximum=array(8, dtype=int64))

我想得到概率向量包含所有由训练策略计算的动作，并在其他应用环境中做进一步处理.但是，该策略仅返回带有单个值的 log_probability，而不是所有操作的向量.有没有办法得到概率向量?

I would like to get the probability vector contains all actions calculated by the trained policy, and do further processing in other application environments. However, the policy only returns log_probability with a single value rather than a vector for all actions. Is there anyway to get the probability vector?

from tf_agents.networks import q_network
from tf_agents.agents.dqn import dqn_agent

q_net = q_network.QNetwork(
            env.observation_spec(),
            env.action_spec(),
            fc_layer_params=(32,)
        )

optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=0.001)

my_agent = dqn_agent.DqnAgent(
    env.time_step_spec(),
    env.action_spec(),
    q_network=q_net,
    epsilon_greedy=epsilon,
    optimizer=optimizer,
    emit_log_probability=True,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=global_step)

my_agent.initialize()

...  # training

tf_policy_saver = policy_saver.PolicySaver(my_agent.policy)
tf_policy_saver.save('./policy_dir/')

# making decision using the trained policy
action_step = my_agent.policy.action(time_step)

在 dqn_agent.DqnAgent() DQNAgent，我设置了emit_log_probability=True，它应该定义策略是否发出日志概率.

In dqn_agent.DqnAgent() DQNAgent, I set emit_log_probability=True, which is supposed to define Whether policies emit log probabilities or not.

但是，当我运行 action_step = my_agent.policy.action(time_step) 时，它返回

However, when I run action_step = my_agent.policy.action(time_step), it returns

PolicyStep(action=<tf.Tensor: shape=(1,), dtype=int64, numpy=array([1], dtype=int64)>, state=(), info=PolicyInfo(log_probability=<tf.Tensor: shape=(1,), dtype=float32, numpy=array([0.], dtype=float32)>))

我也尝试运行action_distribution = saved_policy.distribution(time_step)，它返回

PolicyStep(action=<tfp.distributions.DeterministicWithLogProbCT 'Deterministic' batch_shape=[1] event_shape=[] dtype=int64>, state=(), info=PolicyInfo(log_probability=<tf.Tensor: shape=(), dtype=float32, numpy=0.0>))

如果TF.Agent中没有这样的API，有没有办法得到这样的概率向量?谢谢.

If there is no such API available in TF.Agent, is there a way to get such probability vector? Thanks.

后续问题:

如果我理解正确，深度 Q 网络应该获取 state 的输入并输出状态中每个动作的 Q 值.我可以将这个 Q 值向量传递给 softmax 函数并计算相应的概率向量.实际上我已经用我自己定制的 DQN 脚本(没有 TF-Agent)完成了这样的计算.那么问题就变成了:如何从TF-Agent返回Q值向量?

If I understand correctly, deep Q-network is supposed to get inputs of the state and output the Q-value for each action from the state. I could pass this Q-value vector into a softmax function and calculate the corresponding probability vector. Actually I have done such calculation with my own customized DQN script (without TF-Agent). Then the question becomes: how to return the Q-value vector from TF-Agent?

推荐答案

在 TF-Agents 框架中执行此操作的唯一方法是调用 Policy.distribution() 方法而不是操作方法.这将返回从网络的 Q 值计算出来的原始分布.emit_log_probability=True 仅影响 Policy.action() 返回的 PolicyStep 命名元组的 info 属性.请注意，此分布可能会受到您通过的操作约束(如果您这样做)的影响；因此，非法行为将被标记为概率为 0(即使原始 Q 值可能很高).

The only way to do this in the TF-Agents framework is to invoke the Policy.distribution() method instead of the action method. This would return the original distribution that was computed out of the Q-values of the network. The emit_log_probability=True only affects the info attribute of the PolicyStep namedtuple that Policy.action() returns. Note that this distribution is possibly affected by the action constraints that you pass (if you do); whereby illegal actions will be marked as having 0 probability (even though there original Q-value might have been high).

此外，如果您想查看实际的 Q 值而不是它们生成的分布，那么如果不直接对您的代理随附的 Q 网络采取行动(和它也附加到代理生成的 Policy 对象).如果您想了解如何正确调用 Q-network，我建议您查看 QPolicy._distribution() 方法是如何做到的此处.

If furthermore you would like to see the actual Q-values instead of the distribution that they generate, then I'm afraid there is no way of doing this without acting directly upon the Q-network that comes with your agent (and that is also attached to the Policy object that the agent produces). If you want to see how to call that Q-network properly I recommend looking at how the QPolicy._distribution() method does it here.

请注意，使用预先实现的驱动程序无法完成这些操作.您必须显式地构建自己的收集循环或实现自己的 Driver 对象(这基本上是等效的).

Note that none of this can be done using the pre-implemented Drivers. You would have to either explicitly construct your own collection loop or implement your own Driver object (which is basically equivalent).

这篇关于tf.agent 策略可以为所有动作返回概率向量吗?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

tf.agent 策略可以为所有动作返回概率向量吗? [英] Can tf.agent policy return probability vector for all actions?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

tf.agent 策略可以为所有动作返回概率向量吗? [英] Can tf.agent policy return probability vector for all actions?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭