在强化学习中使用函数逼近时,如何选择动作? [英] When using functional approximation in reinforcement learning how does one select actions?

查看:134
本文介绍了在强化学习中使用函数逼近时,如何选择动作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您使用奖励和您对下一个状态的最佳未来值的估计来更新旧的 Q(s,a) 值.

在表格 Q 学习中,您可以单独估计每个 Q(s,a) 值,并在每次访问状态并采取行动时更新该值.在函数逼近 Q 学习中,您使用神经网络之类的东西来逼近 Q(s,a) 的值.在选择要选择的动作时,您将状态和动作输入到神经网络中,然后获取神经网络对每个动作的近似值.然后根据您的算法选择操作(如 epsilon greedy 方法).当您的代理与环境交互时,您可以使用新数据训练和更新神经网络,以改进函数逼近.

This slide shows an equation for Q(state, action) in terms of a set of weights and feature functions. I'm confused about how to write the feature functions.

Given an observation, I can understand how to extract features from the observation. But given an observation, one doesn't know what the result of taking an action will be on the features. So how does one write a function that maps an observation and an action to a numerical value?

In the Pacman example shown a few slides later, one knows, given a state, what the effect of an action will be. But that's not always the case. For example, consider the cart-pole problem (in OpenAI gym). The features (which are, in fact, what the observation consists of) are four values: cart position, cart velocity, pole angle, and pole rotational velocity. There are two actions: push left, and push right. But one doesn't know in advance how those actions will change the four feature values. So how does one compute Q(s, a)? That is, how does one write the feature functions fi(state, action)?

Thanks.

解决方案

How you select actions depends on your algorithm and your exploration strategy. For example, in Q learning you can do something called epsilon greedy exploration. Espilon % of the time you select an action at random and the other % of the time you take the action with the highest expected value (the greedy action).

So how does one write a function that maps an observation and an action to a numerical value?

By using rewards you can approximate state, action values. Then use the rewards and (depending on the algorithm) the value of the next state. For example a Q learning update formula:

You update the old Q(s,a) value with the reward and your estimate of the optimal future value from the next state.

In tabular Q learning you can estimate each Q(s,a) value individually and update the value everytime you visit a state and take an action. In function approximation Q learning you use something like a neural net to approximate the values of Q(s,a). When choosing what action to select you enter the state and action into the neural net and get back the neural net's approximate values of each action. Then pick the action based on your algorithm (like the epsilon greedy method). As your agent interacts with the environment, you train and update the neural net with the new data to improve the function approximation.

这篇关于在强化学习中使用函数逼近时,如何选择动作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆