强化学习有什么政策? [英] What is a policy in reinforcement learning?

查看:144
本文介绍了强化学习有什么政策?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看过这样的话:

策略定义了学习代理在给定时间的行为方式.大致 从政策上讲,策略是从感知到的环境状态到处于这些状态时要采取的行动的映射.

A policy defines the learning agent's way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states.

但是仍然不完全了解.强化学习到底有什么政策?

But still didn't fully understand. What exactly is a policy in reinforcement learning?

推荐答案

该定义是正确的,但如果您是第一次看到它,则该定义不是立即显而易见的.让我这样说:策略是代理的策略.

The definition is correct, though not instantly obvious if you see it for the first time. Let me put it this way: a policy is an agent's strategy.

例如,想象一个世界,机器人在房间内移动,任务是到达目标点(x,y),并在该点获得奖励.在这里:

For example, imagine a world where a robot moves across the room and the task is to get to the target point (x, y), where it gets a reward. Here:

  • 房间是一个环境
  • 机器人的当前位置是状态
  • 策略是代理程序用来完成此任务的工作:

  • A room is an environment
  • Robot's current position is a state
  • A policy is what an agent does to accomplish this task:

  • 笨拙的机器人只是随机游荡,直到它们意外地出现在正确的位置(策略1)
  • 出于某些原因,其他人可能会学会沿着大部分路线(策略2)沿墙行走
  • 智能机器人在其脑袋"中规划路线并直接到达目标(策略3)

很明显,某些策略要比其他策略好,并且有多种评估方法,分别是状态值函数动作值函数. RL的目标是学习最佳策略.现在,该定义应该更有意义(请注意,在上下文中最好将时间理解为一种状态):

Obviously, some policies are better than others, and there are multiple ways to assess them, namely state-value function and action-value function. The goal of RL is to learn the best policy. Now the definition should make more sense (note that in the context time is better understood as a state):

策略定义了学习代理在给定时间的行为方式.

更正式地说,我们应该首先将 Markov决策过程(MDP)定义为元组(SAPRy),其中:

More formally, we should first define Markov Decision Process (MDP) as a tuple (S, A, P, R, y), where:

  • S是一组有限的状态
  • A是一组有限的动作
  • P是状态转移概率矩阵(对于每个当前状态和每个动作,最终处于某个状态的概率)
  • R是奖励函数,具有状态和动作
  • y是折扣因子,介于0和1之间
  • S is a finite set of states
  • A is a finite set of actions
  • P is a state transition probability matrix (probability of ending up in a state for each current state and each action)
  • R is a reward function, given a state and an action
  • y is a discount factor, between 0 and 1

然后,策略π是在给定状态的动作上的概率分布.当代理处于特定状态时,这就是每个动作的可能性(当然,我在这里跳过了很多细节).此定义对应于定义的第二部分.

Then, a policy π is a probability distribution over actions given states. That is the likelihood of every action when an agent is in a particular state (of course, I'm skipping a lot of details here). This definition corresponds to the second part of your definition.

我强烈建议在YouTube上提供大卫·西尔弗(David Silver)的RL课程.前两个讲座特别关注MDP和政策.

I highly recommend David Silver's RL course available on YouTube. The first two lectures focus particularly on MDPs and policies.

这篇关于强化学习有什么政策?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆