蒙特卡洛搜索树如何工作? [英] How does Monte Carlo Search Tree work?

查看:64
本文介绍了蒙特卡洛搜索树如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试使用YouTube视频和此类论文学习MCST.

Trying to learn MCST using YouTube videos and papers like this one.

http://www0 .cs.ucl.ac.uk/staff/D.Silver/web/Applications_files/grand-challenge.pdf

但是,除了高级的理论解释之外,我对细节的了解还不够运气.这是上面论文中的一些引述以及我的问题.

However I am not having much of a luck understanding the details beyond the high level theoretical explanations. Here are some quotes from the paper above and questions I have.

  1. 选择阶段:MCTS迭代选择当前状态中得分最高的子节点.如果当前状态是根节点,那么这些子代首先来自何处?您是否不会有一棵只有一个根节点的树?仅使用一个根节点,您是否立即进入扩展和仿真阶段?

  1. Selection Phase: MCTS iteratively selects the highest scoring child node of the current state. If the current state is the root node, where did these children come from in the first place? Wouldn't you have a tree with just a single root node to begin with? With just a single root node, do you get into Expansion and Simulation phase right away?

如果MCTS在选择"阶段选择了得分最高的子节点,那么您永远不会探索其他子级甚至是全新的子级?

If MCTS selects the highest scoring child node in Selection phase, you never explore other children or possibly even a brand new child whilst going down the levels of the tree?

节点的扩展阶段如何发生?在上图中,为什么不选择叶子节点,而是决定向该叶子节点添加同级兄弟?

How does the Expansion phase happen for a node? In the diagram above, why did it not choose leaf node but decided to add a sibling to the leaf node?

在模拟阶段,随机策略用于选择两个玩家的合法举动,直到游戏终止.这种随机策略是硬编码的行为吗?您基本上是在模拟中掷骰子,以选择每个玩家之间轮流玩到最后的可能举动之一吗?

During the Simulation phase, stochastic policy is used to select legal moves for both players until the game terminates. Is this stochastic policy a hard-coded behavior and you are basically rolling a dice in the simulation to choose one of the possible moves taking turns between each player until the end?

我理解的方式是从一个根节点开始,通过重复上述阶段,将树构造到一定深度.然后,在第二步中选择得分最高的孩子作为下一步.您愿意构建的树的大小基本上就是您对AI的严格响应能力要求吗?由于在构建树时,游戏将停止并计算该树.

The way I understand this is you start at a single root node and by repeating the above phases you construct the tree to a certain depth. Then you choose the child with the best score at the second level as your next move. The size of the tree you are willing to construct is basically your hard AI responsiveness requirement right? Since while the tree is being constructed the game will stall and compute this tree.

推荐答案

  1. 选择阶段:MCTS迭代选择当前状态中得分最高的子节点.如果当前状态是根节点,那么这些子代首先来自何处?您是否不会有一棵只有一个根节点的树?仅使用一个根节点,您是否立即进入扩展和仿真阶段?

选择步骤通常实现为实际上不选择树中实际存在的节点(已通过扩展步骤创建).通常,建议在与您当前节点匹配的游戏状态的所有可能的后续状态中进行选择.

The selection step is typically implemented not to actually choose among nodes which really exist in the tree (having been created through the Expansion step). It is typically ipmlemented to choose among all possible successor states of the game state matching your current node.

因此,从一开始,当您只有一个根节点时,您将希望选择"步骤仍然能够从所有可能的后续游戏状态中选择一个(即使它们没有匹配的节点)在树上).通常,对于尚未访问过的游戏状态(树中还没有节点),您会希望获得很高的分数(无限,或一些非常大的常数).这样,您的选择步骤将始终在没有匹配节点的任何状态中随机选择,并且仅在所有可能的游戏状态在树中都具有匹配节点的情况下,才真正使用探索与开发权衡

So, at the very beginning, when you have just a root node, you'll want your Selection step to still be able to select one out of all the possible successor game states (even if they don't have matching nodes in the tree yet). Typically you'll want a very high score (infinite, or some very large constant) for game states which have never been visited yet (which don't have nodes in the tree yet). This way, your Selection Step will always randomly select among any states that don't have a matching node yet, and only really use the exploration vs. exploitation trade-off in cases where all possible game states already have a matching node in the tree.

  1. 如果MCTS在选择"阶段选择了得分最高的子节点,您在降落到树的层次上时不会探索其他子节点,甚至是全新的子节点吗?

选择"步骤使用的分数"通常不应仅是通过该节点的所有模拟结果的平均值.它通常应该是一个由两部分组成的分数; 探索"部分对访问频率相对较低的节点较高,而探索"部分对到目前为止看来不错的节点较高(其中,通过该节点的许多模拟均以获胜告终)允许选择举动的球员).您链接的论文的第3.4节对此进行了描述. W(s, a) / N(s, a)是开发部分(仅是平均分数),B(s, a)是开发部分.

The ''score'' used by the Selection step should typically not just be the average of all outcomes of simulations going through that node. It should typically be a score consisting of two parts; an "exploration" part, which is high for nodes that have been visited relatively infrequently, and an "exploitation" part, which is high for nodes which appear to be good moves so far (where many simulations going through that node ended in a win for the player who's allowed to choose a move to make). This is described in Section 3.4 of the paper you linked. The W(s, a) / N(s, a) is the exploitation part (simply average score), and the B(s, a) is the exploration part.

  1. 节点的扩展阶段如何发生?在上图中,为什么不选择叶子节点,而是决定向该叶子节点添加同级兄弟?

扩展步骤通常被实现为简单地添加与选择步骤选择的最终游戏状态相对应的节点(在我回答第一个问题之后,选择步骤将始终以选择从未被选择的游戏状态结束之前选择的.)

The Expansion step is typically implemented to simply add a node corresponding to the final game state selected by the Selection Step (following what I answered to your first question, the Selection Step will always end in selecting one game state that has never been selected before).

  1. 在模拟阶段,随机策略用于选择两个玩家的合法举动,直到游戏终止.这种随机策略是一种硬编码的行为吗?您基本上是在模拟中掷骰子,以选择每个玩家之间轮流进行直到结束的可能举动之一吗?

最直接(也是最常见)的实现确实是完全随机进行的.不过,也可以以不同的方式执行此操作.例如,您可以使用试探法对某些操作产生偏见.通常,完全随机播放的速度更快,使您可以在相同的处理时间内运行更多的模拟.但是,这通常也意味着每个单独的模拟信息量都较少,这意味着您实际上需要运行更多的模拟才能使MCTS正常运行.

The most straightforward (and probably most common) implementation is indeed to play completely at random. It is also possible to do this differently though. You could for example use heuristics to create a bias towards certain actions. Typically, completely random play is faster, allowing you to run more simulations in the same amount of processing time. However, it typically also means every individual simulation is less informative, meaning you actually need to run more simulations for MCTS to play well.

  1. 我理解的方式是从一个根节点开始,通过重复上述阶段,将树构造到一定深度.然后,在第二步中选择得分最高的孩子作为下一步.您愿意构造的树的大小基本上就是您对AI的严格响应能力要求,对吗?由于在构建树时,游戏将停止并计算该树.

MCTS不会将树的所有部分均匀地探索到相同的深度.它倾向于探索似乎有趣的部分(强有力的动作),而不是看起来似乎无趣的部分(弱动作).因此,通常您不会真正使用深度限制.取而代之的是,您将使用时间限制(例如,持续运行迭代,直到花费了1秒,5秒或1分钟或您允许的任何处理时间),或者使用了迭代计数限制(例如,允许它运行10K或50K或您喜欢的任何数量的仿真.

MCTS does not uniformly explore all parts of the tree to the same depth. It has a tendency to explore parts which appear to be interesting (strong moves) deeper than parts which appear to be uninteresting (weak moves). So, typically you wouldn't really use a depth limit. Instead, you would use a time limit (for example, keep running iterations until you've spent 1 second, or 5 seconds, or 1 minute, or whatever amount of processing time you allow), or an iteration count limit (for example, allow it to run 10K or 50K or any number of simulations you like).

这篇关于蒙特卡洛搜索树如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆