强化学习有什么政策? [英] What is a policy in reinforcement learning?

查看：144 发布时间：2020/5/4 9:24:38 machine-learning terminology reinforcement-learning markov-decision-process

本文介绍了强化学习有什么政策?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我看过这样的话:

策略定义了学习代理在给定时间的行为方式.大致从政策上讲，策略是从感知到的环境状态到处于这些状态时要采取的行动的映射.

A policy defines the learning agent's way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states.

但是仍然不完全了解.强化学习到底有什么政策?

But still didn't fully understand. What exactly is a policy in reinforcement learning?

推荐答案

该定义是正确的，但如果您是第一次看到它，则该定义不是立即显而易见的.让我这样说:策略是代理的策略.

The definition is correct, though not instantly obvious if you see it for the first time. Let me put it this way: a policy is an agent's strategy.

例如，想象一个世界，机器人在房间内移动，任务是到达目标点(x，y)，并在该点获得奖励.在这里:

For example, imagine a world where a robot moves across the room and the task is to get to the target point (x, y), where it gets a reward. Here:

房间是一个环境
机器人的当前位置是状态
策略是代理程序用来完成此任务的工作:

A room is an environment
Robot's current position is a state
A policy is what an agent does to accomplish this task:

笨拙的机器人只是随机游荡，直到它们意外地出现在正确的位置(策略1)
出于某些原因，其他人可能会学会沿着大部分路线(策略2)沿墙行走
智能机器人在其脑袋"中规划路线并直接到达目标(策略3)

很明显，某些策略要比其他策略好，并且有多种评估方法，分别是状态值函数和动作值函数. RL的目标是学习最佳策略.现在，该定义应该更有意义(请注意，在上下文中最好将时间理解为一种状态):

Obviously, some policies are better than others, and there are multiple ways to assess them, namely state-value function and action-value function. The goal of RL is to learn the best policy. Now the definition should make more sense (note that in the context time is better understood as a state):

策略定义了学习代理在给定时间的行为方式.

更正式地说，我们应该首先将 Markov决策过程(MDP)定义为元组(S，A，P，R，y)，其中:

More formally, we should first define Markov Decision Process (MDP) as a tuple (S, A, P, R, y), where:

S是一组有限的状态
A是一组有限的动作
P是状态转移概率矩阵(对于每个当前状态和每个动作，最终处于某个状态的概率)
R是奖励函数，具有状态和动作
y是折扣因子，介于0和1之间

S is a finite set of states
A is a finite set of actions
P is a state transition probability matrix (probability of ending up in a state for each current state and each action)
R is a reward function, given a state and an action
y is a discount factor, between 0 and 1

然后，策略π是在给定状态的动作上的概率分布.当代理处于特定状态时，这就是每个动作的可能性(当然，我在这里跳过了很多细节).此定义对应于定义的第二部分.

Then, a policy π is a probability distribution over actions given states. That is the likelihood of every action when an agent is in a particular state (of course, I'm skipping a lot of details here). This definition corresponds to the second part of your definition.

我强烈建议在YouTube上提供大卫·西尔弗(David Silver)的RL课程.前两个讲座特别关注MDP和政策.

I highly recommend David Silver's RL course available on YouTube. The first two lectures focus particularly on MDPs and policies.

这篇关于强化学习有什么政策?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

强化学习有什么政策? [英] What is a policy in reinforcement learning?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

强化学习有什么政策? [英] What is a policy in reinforcement learning?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭