何时使用某种强化学习算法? [英] When to use a certain Reinforcement Learning algorithm?

本文介绍了何时使用某种强化学习算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习强化学习,正在阅读萨顿的大学课程书.除了经典的PD,MC,TD和Q-Learning算法外,我还在阅读有关用于解决决策问题的策略梯度方法和遗传算法. 我以前从未在这个主题上有过经验,并且在理解何时应该首选一种技术而不是另一种技术时遇到了问题.我有一些想法,但是我不确定.有人可以简要地向我解释或告诉我可以从中找到关于应该使用某些方法的典型情况的信息吗?据我了解:

I'm studying Reinforcement Learning and reading Sutton's book for a university course. Beside the classic PD, MC, TD and Q-Learning algorithms, I'm reading about policy gradient methods and genetic algorithms for the resolution of decision problems. I have never had experience before in this topic and I'm having problems understanding when a technique should be preferred over another. I have a few ideas, but I'm not sure about them. Can someone briefly explain or tell me a source where I can find something about typical situation where a certain methods should be used? As far as I understand:

    仅当MDP的动作和状态很少且模型已知时,才应使用动态编程和线性编程,因为它非常昂贵.但是什么时候DP比LP更好呢?
  • 当我没有问题的模型但可以生成样本时,将使用Monte Carlo方法.它没有偏见,但差异很大.
  • 当MC方法需要太多样本以至于具有低方差时,应使用时差"方法.但是我应该何时使用TD和Q-Learning? 策略梯度和遗传算法非常适合连续MDP.但是,当一个比另一个更好时?
  • Dynamic Programming and Linear Programming should be used only when the MDP has few actions and states and the model is known, since it's very expensive. But when DP is better than LP?
  • Monte Carlo methods are used when I don't have the model of the problem but I can generate samples. It does not have bias but has high variance.
  • Temporal Difference methods should be used when MC methods need too many samples to have low variance. But when should I use TD and when Q-Learning?
  • Policy Gradient and Genetic algorithms are good for continuous MDPs. But when one is better than the other?

更确切地说,我认为程序员要选择一种学习方法,应该问自己以下问题:

More precisely, I think that to choose a learning methods a programmer should ask himlself the following questions:

  • 座席是在线学习还是线下学习?
  • 我们可以将勘探和开发阶段分开吗?
  • 我们可以进行足够的探索吗?
  • MDP的范围是有限的还是无限的?
  • 状态和动作是连续的吗?

但是我不知道问题的这些细节如何影响学习方法的选择. 我希望一些程序员已经对RL方法有一定的经验,可以帮助我更好地理解它们的应用.

But I don't know how these details of the problem affect the choice of a learning method. I hope that some programmer has already had some experience about RL methods and can help me to better understand their applications.

推荐答案

简而言之:

座席是在线学习还是离线学习?可帮助您决定使用在线还是离线算法. (例如,在线:SARSA,离线:Q学习).在线方法有更多的局限性,需要更多的注意.

does the agent learn online or offline? helps you to decide either using on-line or off-line algorithms. (e.g. on-line: SARSA, off-line: Q-learning). On-line methods have more limitations and need more attention to pay.

我们可以将探索和开发阶段分开吗?这两个阶段通常处于平衡状态.例如,在epsilon贪婪动作选择中,您使用(epsilon)概率进行攻击,并使用(1-epsilon)概率进行探索.您可以将两者分开,并要求算法先探索(例如选择随机动作),然后再利用.但是,当您离线学习并且可能使用系统动力学模型时,这种情况是可能的.通常,这意味着需要事先收集大量样本数据.

can we separate exploring and exploiting phases? These two phase are normally in a balance. For example in epsilon-greedy action selection, you use an (epsilon) probability for exploiting and (1-epsilon) probability for exploring. You can separate these two and ask the algorithm just explore first (e.g. choosing random actions) and then exploit. But this situation is possible when you are learning off-line and probably using a model for the dynamics of the system. And it normally means collecting a lot of sample data in advance.

我们可以执行足够的探索吗?可以根据问题的定义来决定探索的级别.例如,如果您在内存中有一个模拟问题的模型,则可以根据需要进行探索.但是,真正的探索仅限于您拥有的资源量. (例如能量,时间...)

can we perform enough exploration? The level of exploration can be decided depending on the definition of the problem. For example, if you have a simulation model of the problem in memory, then you can explore as you want. But real exploring is limited to amount of resources you have. (e.g. energy, time, ...)

状态和动作是否连续?考虑这种假设有助于选择正确的方法(算法).为RL开发了离散算法和连续算法.一些连续"算法在内部离散状态或动作空间.

are states and actions continuous? Considering this assumption helps to choose the right approach (algorithm). There are both discrete and continuous algorithms developed for RL. Some of "continuous" algorithms internally discretize the state or action spaces.

这篇关于何时使用某种强化学习算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆