如何在真正的在线模式下演示Vowpal Wabbit的上下文强盗? [英] How to demo Vowpal Wabbit's contextual bandits in real online mode?

查看:84
本文介绍了如何在真正的在线模式下演示Vowpal Wabbit的上下文强盗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据可用的文档和资源,还不清楚如何完成一个简单的入门流程,在该流程中,您将Vowpal Wabbit作为守护程序启动(甚至可能没有任何预学习模型)并使其在线学习和使用.探索―我正在寻找在上下文中提供反馈,获得建议并反馈成本/报酬的流程.

Following the available docs and resources, it is not really clear how to accomplish a simple getting-started flow where you'd launch Vowpal Wabbit as a daemon (possibly even without any pre-learnt model) and have it online learn and explore ― I'm looking for a flow where I'd feed in a context, get back a recommendation, and feed back a cost/reward.

因此,让我跳过对尝试过的技术的描述,仅要求就我认为在这方面必不可少的内容进行清晰的演示-

So let me skip the technical descriptions of what's been tried and simply ask for a clear demonstration regarding what I might consider essential in this vein ―

  • 如何通过守护程序演示正在进行的学习,而不是从批处理数据以脱机模式而是仅从在线交互进行脱机?有什么好的建议吗?
  • 如何在守护程序模式下报告选定操作后的成本/报酬?每个动作一次?散装?无论哪种方式,如何?
  • 有些相关-您会为现场匪徒推荐一个使用守护程序的实时系统吗?还是一些语言API?
  • 您是否可以指向服务器代码位于巨大代码库中的位置?这可能是一个系统地进行探索的好地方.

我通常会得到一个分布(允许的动作数量的大小)作为对每个发送的输入的答复.不管我发送了什么,通常情况下都是相同的分布.也许我不知道,使用默认的--cb_explore算法需要花费整个学习时间,并且不确定时间长度可以从外部设置.

I typically get a distribution (the size of the number of allowed actions) as a reply for every input sent. Typically the same distribution regardless of what I sent in. Maybe it takes a whole learning epoch with the default --cb_explore algorithm, I wouldn't know, and am not sure the epoch duration can be set from outside.

我知道在从过去的交互以及验证的数据中进行学习中已经做了很多工作.但是,我认为也应该有一些可用的解释来清除上面的那些或多或少的实用要素.

I understand that so much has been put into enabling learning from past interactions, and from cbfied data. However I think there should also be some available explanation clearing those more-or-less pragmatic essentials above.

非常感谢!

推荐答案

就可以了.此流程仅需要Vowpal Wabbit输入格式的子集.成功安装后,首先要启动Vowpal Wabbit守护程序:

There it goes. This flow only necessitates a subset of the Vowpal Wabbit input format. First off after successful installation, we start off a Vowpal Wabbit daemon:

vw --cb_explore 2 --daemon --port 26542 --save_resume

在上面,我们告诉VW启动Contextual Bandit模型服务守护程序,而无需通过旧的策略数据提供任何前期培训.该模型将是VW的默认上下文强盗模型,并且将假定如上指定,只有两个操作可供选择. Vowpal最初将随机分配建议的操作,并随着时间的流逝接近最佳策略.

In the above, we tell VW to start a Contextual Bandit model serving daemon, without any upfront training having been provided through old policy data. The model will be the default contextual bandits model of VW, and it will assume as above specified, just two actions to choose from. Vowpal will initially assign suggested actions by random, and will over time approach the optimal policy.

让我们检查一下守护进程是否已启动:pgrep 'vw.*'应该返回进程列表.

Let's just check the daemon is up: pgrep 'vw.*' should return a list of processes.

在以后的任何时候,如果我们想停止守护程序并重新启动它,我们都可以简单地pkill -9 -f 'vw.*--port 26542'.

At any time later if we wanted to stop the daemon and start it again we would simply pkill -9 -f 'vw.*--port 26542'.

现在让我们模拟决策点和所采取行动的成本.在下面的文章中,我将使用终端方式将消息发送到守护程序,但是您可以使用邮递员或您自己的代码之类的工具来执行此操作:

Now let us simulate decision points and costs obtained for the actions taken. In the following I use the terminal way of dispatching messages to the daemon, but you can exercise this with a tool like postman or your own code:

echo " | a b " | netcat localhost 26542

在这里,我们只是告诉Vowpal建议对包含功能集(ab)的上下文应采取的操作.

Here we just told Vowpal to suggest what action we should take for a context comprising the feature set (a, b).

Vowpal简洁地回答不是选择行动,而是根据两个行动中每个行动的预期成本分布,我们的模型应从以下选项中进行选择:

Vowpal succinctly replies not with a chosen action, but with a distribution of predicted costs per each of the two actions our model was instructed to chose from:

0.975000 0.025000

这些当然只是随机初始化的结果,因为尚未看到任何费用!现在,我们期望使用Vowpal的应用程序根据此分布随机选择统一-这部分不是由Vowpal实现的,而是留给应用程序代码. Contextual Bandits模型依赖于我们从此分布中采样来选择要针对环境进行的动作-如果我们不遵循这一期望,则该算法可能无法完成其学习.

These are of course only the result of some random initialization, as it's not seen any costs yet! Now our application using Vowpal is expected to choose uniformly at random according to this distribution ― this part is not implemented by Vowpal but left to application code. The Contextual Bandits model relies on us sampling from this distribution for choosing the action to be played against the environment ― if we do not follow this expectation ― the algorithm may not accomplish its learning.

因此,假设我们从此发行版中采样并得到操作1,然后在实际环境中执行该操作(对于相同的上下文,我们要求Vowpal推荐使用a b).想象一下,这次我们收回了0.7的费用.我们必须将此成本作为反馈反馈给Vowpal:

So imagine we sampled from this distribution, and got action 1, then executed that action in the real-world environment (for the same context a b we asked Vowpal to recommend for). Imagine we got back a cost 0.7 this time. We have to communicate this cost back to Vowpal as feedback:

echo " 1:0.7:1 | a b " | netcat localhost 26542

Vowpal得到了我们的反馈,并为我们提供了针对这种情况的最新预测:

Vowpal got our feedback, and gives us back its updated prediction for this context:

0.975000 0.025000

除非我们希望再次针对完全相同的上下文提供建议,否则我们现在不在乎它,但是无论如何我们都会得到其更新的建议.

We don't care about it right now unless we wish to get a recommendation for the exact same context again, but we get its updated recommendation anyway.

很明显,它是与以前相同的建议,因为到目前为止,我们的单个反馈还不足以使该模型学习任何东西.重复多次,对于不同的上下文功能,从Vowpal返回的预测将适应和更改.在许多不同的上下文中重复此过程很多次,该模型将根据所学的知识开始转移其预测.

Obviously it's the same recommendation as before, as our single feedback so far isn't enough for the model for learning anything. Repeated many times, and for different context features, the predictions returned from Vowpal will adapt and change. Repeat this process for many times and for many different contexts, and the model will begin shifting its predictions per what it has learned.

请注意,我在这里提到的是成本而不是报酬,因为与Vowpal中实现的算法的许多文献不同,至少命令行版本将成本作为反馈,而不是报酬.

Note I mention costs and not rewards here, as unlike much of the literature of the algorithms implemented in Vowpal, the command-line version at least, takes costs as feedback, not rewards.

这篇关于如何在真正的在线模式下演示Vowpal Wabbit的上下文强盗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆