强化学习有什么政策? [英] What is a policy in reinforcement learning?
问题描述
我看过这样的话:
策略定义了学习代理在给定时间的行为方式.大致 从政策上讲,策略是从感知到的环境状态到处于这些状态时要采取的行动的映射.
A policy defines the learning agent's way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states.
但是仍然不完全了解.强化学习到底有什么政策?
But still didn't fully understand. What exactly is a policy in reinforcement learning?
推荐答案
该定义是正确的,但如果您是第一次看到它,则该定义不是立即显而易见的.让我这样说:策略是代理的策略.
The definition is correct, though not instantly obvious if you see it for the first time. Let me put it this way: a policy is an agent's strategy.
例如,想象一个世界,机器人在房间内移动,任务是到达目标点(x,y),并在该点获得奖励.在这里:
For example, imagine a world where a robot moves across the room and the task is to get to the target point (x, y), where it gets a reward. Here:
- 房间是一个环境
- 机器人的当前位置是状态
-
策略是代理程序用来完成此任务的工作:
- A room is an environment
- Robot's current position is a state
A policy is what an agent does to accomplish this task:
- 笨拙的机器人只是随机游荡,直到它们意外地出现在正确的位置(策略1)
- 出于某些原因,其他人可能会学会沿着大部分路线(策略2)沿墙行走
- 智能机器人在其脑袋"中规划路线并直接到达目标(策略3)
很明显,某些策略要比其他策略好,并且有多种评估方法,分别是状态值函数和动作值函数. RL的目标是学习最佳策略.现在,该定义应该更有意义(请注意,在上下文中最好将时间理解为一种状态):
Obviously, some policies are better than others, and there are multiple ways to assess them, namely state-value function and action-value function. The goal of RL is to learn the best policy. Now the definition should make more sense (note that in the context time is better understood as a state):
策略定义了学习代理在给定时间的行为方式.
更正式地说,我们应该首先将 Markov决策过程(MDP)定义为元组(S
,A
,P
,R
,y
),其中:
More formally, we should first define Markov Decision Process (MDP) as a tuple (S
, A
, P
, R
, y
), where:
-
S
是一组有限的状态 -
A
是一组有限的动作 -
P
是状态转移概率矩阵(对于每个当前状态和每个动作,最终处于某个状态的概率) -
R
是奖励函数,具有状态和动作 -
y
是折扣因子,介于0和1之间
S
is a finite set of statesA
is a finite set of actionsP
is a state transition probability matrix (probability of ending up in a state for each current state and each action)R
is a reward function, given a state and an actiony
is a discount factor, between 0 and 1
然后,策略π
是在给定状态的动作上的概率分布.当代理处于特定状态时,这就是每个动作的可能性(当然,我在这里跳过了很多细节).此定义对应于定义的第二部分.
Then, a policy π
is a probability distribution over actions given states. That is the likelihood of every action when an agent is in a particular state (of course, I'm skipping a lot of details here). This definition corresponds to the second part of your definition.
我强烈建议在YouTube上提供大卫·西尔弗(David Silver)的RL课程.前两个讲座特别关注MDP和政策.
I highly recommend David Silver's RL course available on YouTube. The first two lectures focus particularly on MDPs and policies.
这篇关于强化学习有什么政策?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!