马尔可夫决策过程:值迭代,它是如何工作的? [英] Markov Decision Process: value iteration, how does it work?

查看:1290
本文介绍了马尔可夫决策过程:值迭代,它是如何工作的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读了很多关于马尔可夫决策过程(使用值迭代)最近,但我根本无法让我的头周围。我发现在互联网上/书了大量的资源,但它们都使用数学公式,这对我的能力太复杂。

I've been reading a lot about Markov Decision Processes (using value iteration) lately but I simply can't get my head around them. I've found a lot of resources on the Internet / books, but they all use mathematical formulas that are way too complex for my competencies.

由于这是我在大学的第一年,我发现网上提供的解释和公式中使用的概念/术语实在是太复杂了,我和他们假定读者都知道,我已经简单的一些事情从来没有听说过的。

Since this is my first year at college, I've found that the explanations and formulas provided on the web use notions / terms that are way too complicated for me and they assume that the reader knows certain things that I've simply never heard of.

我想用它在2D网格(填充墙(达不到的),硬币(希望)和敌人的移动(必须不惜一切代价避免))。整个目标是收集所有硬币而不触及敌人,我想创建一个AI利用马尔可夫决策过程的主要球员(的 MDP 的)。下面是它的部分看起来像(注意,游戏相关的方面是没有那么多这里关注我真的想了解的的MDP 的一般。):

I want to use it on a 2D grid (filled with walls(unattainable), coins(desirable) and enemies that move(which must be avoided at all costs)). The whole goal is to collect all the coins without touching the enemies, and I want to create an AI for the main player using a Markov Decision Process (MDP). Here is how it partially looks like (note that the game-related aspect is not so much of a concern here. I just really want to understand MDPs in general):

据我了解,中的的MDP 的一个粗鲁的简化是,他们可以创建一个拥有我们需要往哪个方向走(一种指向的箭一格的,我们需要一个网格去,开始于对电网的特定位置),以获得对某些目标,并避免某些障碍。具体到我的情况,这将意味着它可以让玩家知道往哪个方向走,收集硬币和避免的敌人。

From what I understand, a rude simplification of MDPs is that they can create a grid which holds in which direction we need to go (kind of a grid of "arrows" pointing where we need to go, starting at a certain position on the grid) to get to certain goals and avoid certain obstacles. Specific to my situation, that would mean that it allows the player to know in which direction to go to collect the coins and avoid the enemies.

现在,使用的 MDP 的方面,那岂不是它创建状态的集合(网格),持有一定的政策(采取的措施 - >上,下,左,右)为一定的状态(在网格上的位置)。这些政策是由每个国家的实用的价值观,它本身是由评估多少到达那里将有利于在短期和长期计算确定。

Now, using the MDP terms, it would mean that it creates a collection of states(the grid) which holds certain policies(the action to take -> up, down, right, left) for a certain state(a position on the grid). The policies are determined by the "utility" values of each state, which themselves are calculated by evaluating how much getting there would be beneficial in the short and long term.

这是正确的?或者完全是我在错误的轨道上吗?

Is this correct? Or am I completely on the wrong track?

我至少想知道从下面的公式在我的处境中的变量重新present:

I'd at least like to know what the variables from the following equation represent in my situation:

(摘自本书人工智能 - 一种现代方法从罗素放大器;弱势族群)

(taken from the book "Artificial Intelligence - A Modern Approach" from Russell & Norvig)

我知道取值是从电网所有的方块的列表, A 将是一个特定的行动(上/下/左/右),但对于其他人呢?

I know that s would be a list of all the squares from the grid, a would be a specific action (up / down / right / left), but what about the rest?

如何奖励和实用功能来实现?

How would the reward and utility functions be implemented?

这将是真正伟大的,如果有人知道这说明假code,以实现一个非常缓慢的方式基本版本有相似之处,我的情况简单的链接,因为我甚至不知道从哪里开始在这里。

It would be really great if someone knew a simple link which shows pseudo-code to implement a basic version with similarities to my situation in a very slow way, because I don't even know where to start here.

感谢您precious时间。

Thank you for your precious time.

(注:随意添加/删除标记或告诉我的意见,如果我要得到的东西之类的东西更多详情)

推荐答案

是的,数学符号可以使它看起来复杂得多比它。真的,这是一个非常简单的想法。我有一个实施的值迭代演示程序,你可以玩得到一个更好的主意。

Yes, the mathematical notation can make it seem much more complicated than it is. Really, it is a very simple idea. I have a implemented a value iteration demo applet that you can play with to get a better idea.

基本上,可以说你有一个2D网格中有一个机器人。该机器人可以尝试移动北,南,东,西(这些都是操作的),但是,由于它的左车轮打滑时,它试图将北只有一个0.9的概率,它最终会在广场它北虽然有.1概率,它最终会在它的广场西(同样,对于其他3个动作)。这些概率是由T()函数抓获。也就是说,T(S,A,S')看起来像:

Basically, lets say you have a 2D grid with a robot in it. The robot can try to move North, South, East, West (those are the actions a) but, because its left wheel is slippery, when it tries to move North there is only a .9 probability that it will end up at the square North of it while there is a .1 probability that it will end up at the square West of it (similarly for the other 3 actions). These probabilities are captured by the T() function. Namely, T(s,A,s') will look like:

s    A      s'     T    //x=0,y=0 is at the top-left of the screen
x,y  North  x,y+1  .9   //we do move north
x,y  North  x-1,y  .1   //wheels slipped, so we move West
x,y  East   x+1,y  .9
x,y  East   x,y-1  .1
x,y  South  x,y+1  .9
x,y  South  x-1,y  .1 
x,y  West   x-1,y  .9
x,y  West   x,y+1  .1 

您然后设置奖励为0所有国家,但100的目标状态,即,位置你想要的机器人去。

You then set the Reward to be 0 for all states, but 100 for the goal state, that is, the location you want the robot to get to.

什么样的​​价值迭代确实是它开始由所有其他国家提供的100的目标状态0一个实用程序和。然后在第一次迭代这100效用被分发回1步距离球门,这样就可以到达目标状态的1步(全部4格旁边吧)会得到一些程序中的所有国家。也就是说,他们将获得一个实用工具等于概率从该状态,我们可以得到规定的目标。然后,我们继续迭代,每一步我们移动工具回到1更多的步距球门。

What value-iteration does is its starts by giving a Utility of 100 to the goal state and 0 to all the other states. Then on the first iteration this 100 of utility gets distributed back 1-step from the goal, so all states that can get to the goal state in 1 step (all 4 squares right next to it) will get some utility. Namely, they will get a Utility equal to the probability that from that state we can get to the goal stated. We then continue iterating, at each step we move the utility back 1 more step away from the goal.

在上面的例子中,假设在开始有R(5,5)= 100和R(。)= 0对于所有其他状态。所以,目标是得到5,5

In the example above, say you start with R(5,5)= 100 and R(.) = 0 for all other states. So the goal is to get to 5,5.

在我们设定的第一次迭代

On the first iteration we set

R(5,6)=伽玛*(0.9 * 100)+伽玛*(.1 * 100)

R(5,6) = gamma * (.9 * 100) + gamma * (.1 * 100)

因为5,6,如果你去北有结束了在5,5,而如果你去西有结束了在5,5的概率.1一个0.9的概率。

because on 5,6 if you go North there is a .9 probability of ending up at 5,5, while if you go West there is a .1 probability of ending up at 5,5.

类似地,对于(5,4),(4,5),(6,5)

Similarly for (5,4), (4,5), (6,5).

所有其他国家= 0值迭代的第一次迭代后仍然带U。

All other states remain with U = 0 after the first iteration of value iteration.

这篇关于马尔可夫决策过程:值迭代,它是如何工作的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆