在强化学习中设定与状态有关的动作 [英] State dependent action set in reinforcement learning

查看:497
本文介绍了在强化学习中设定与状态有关的动作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

人们如何处理不同州的法律行为不同的问题?在我的案例中,我总共有大约10个诉讼,法律诉讼并不重叠,这意味着在某些州中,相同的3个州始终是合法的,而在其他类型的州中这些州从不合法.

How do people deal with problems where the legal actions in different states are different? In my case I have about 10 actions total, the legal actions are not overlapping, meaning that in certain states, the same 3 states are always legal, and those states are never legal in other types of states.

我也想知道,如果法律行动重叠,解决方案是否会有所不同.

I'm also interested in see if the solutions would be different if the legal actions were overlapping.

对于Q学习(我的网络为我提供状态/动作对的值),我在想,也许我在构建目标值时可能要谨慎选择哪个Q值. (即,我没有选择最高限额,而是在法律诉讼中选择了最高限额...)

For Q learning (where my network gives me the values for state/action pairs), I was thinking maybe I could just be careful about which Q value to choose when I'm constructing the target value. (ie instead of choosing the max, I choose the max among legal actions...)

对于Policy-Gradient类型的方法,我不确定合适的设置是什么.在计算损耗时仅遮盖输出层可以吗?

For Policy-Gradient type of methods I'm less sure of what the appropriate setup is. Is it okay to just mask the output layer when computing the loss?

推荐答案

当前,此问题似乎没有一个简单而通用的答案.也许因为这不是问题?

Currently this problem seems to not have one, universal and straight-forward answer. Maybe because it is not that of an issue?

您为合法动作选择最佳Q值的建议实际上是解决此问题的建议方法之一.对于策略梯度方法,您可以通过掩盖非法操作并适当扩大其他操作的概率来获得相似的结果.

Your suggestion of choosing the best Q value for legal actions is actually one of the proposed ways to handle this. For policy gradients methods you can achieve similar result by masking the illegal actions and properly scaling up the probabilities of the other actions.

其他方法将是对选择非法行为给予否定的奖励-或无视选择而不对环境进行任何更改,返回与以前相同的奖励.作为我的一种个人经验(Q学习方法),我选择了后者,代理人了解了他必须学习的知识,但是他不时将非法行为作为不采取行动"使用.对我而言,这并不是真正的问题,但是负面奖励可能会消除这种行为.

Other approach would be giving negative rewards for choosing an illegal action - or ignoring the choice and not making any change in the environment, returning the same reward as before. For one of my personal experiences (Q Learning method) I've chosen the latter and the agent learned what he has to learn, but he was using the illegal actions as a 'no action' action from time to time. It wasn't really a problem for me, but negative rewards would probably eliminate this behaviour.

如您所见,当操作重叠"时,这些解决方案不会改变或有所不同.

As you see, these solutions don't change or differ when the actions are 'overlapping'.

回答您在评论中提出的问题-我认为您可以在代理人员不了解法律/非法行为规则的情况下,在描述的条件下对其进行培训.例如,对于每组法律诉讼,这都需要类似的单独网络,而且听起来似乎不是最好的主意(尤其是在有很多可能的法律诉讼集的情况下).

Answering what you've asked in the comments - I don't believe you can train the agent in described conditions without him learning the legal/illegal actions rules. This would need, for example, something like separate networks for each set of legal actions and doesn't sound like the best idea (especially if there are lots of possible legal action sets).

但是学习这些规则难吗?

But is the learning of these rules hard?

您必须自己回答一些问题-是使行动非法,难以表达/表达的条件吗?当然,这是特定于环境的,但是我要说的是大部分时间表达起来并不难,代理商只是在培训过程中学习它们.如果很困难,您的环境是否提供了有关状态的足够信息?

You have to answer some questions yourself - is the condition, that makes the action illegal, hard to express/articulate? It is, of course, environment-specific, but I would say that it is not that hard to express most of the time and agents just learn them during training. If it is hard, does your environment provide enough information about the state?

这篇关于在强化学习中设定与状态有关的动作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆