与 Vowpal Wabbit Contextual Bandit 训练数据格式的混淆 [英] Confusion with Vowpal Wabbit Contextual Bandit training data formatting

查看：42 发布时间：2021/7/7 18:57:00 reinforcement-learning vowpalwabbit

本文介绍了与 Vowpal Wabbit Contextual Bandit 训练数据格式的混淆的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是 Vowpal Wabbit 的新手，正在研究多臂老虎机模型，为注册弹出窗口推荐不同的 CTA.我已经在主站点上完成了演练，但是对于 --cb_explore_adf 版本的训练数据应该是什么样子，我有点困惑.到目前为止，对于常规版本(带有设置的操作总数)，数据如下所示:

I am new to Vowpal Wabbit and am working on a multi-arm bandit model to recommend different CTAs for sign up pop ups. I already completed the walkthrough on the main site but am a bit confuse on what the training data is supposed to look like for the --cb_explore_adf version. So far, for regular versions (with set action totals) the data looks like:

action:cost:probability | features

这是有道理的，但是当你使用 adf 版本时，它变成了:

which makes sense, but then when you get to the adf version, it becomes:

| a:1 b:0.5
0:0.1:0.75 | a:0.5 b:1 c:2
 
shared | s_1 s_2
0:1.0:0.5 | a:1 b:1 c:1
| a:0.5 b:2 c:1

我已经多次阅读文档，但我仍然不明白这是如何工作的.

I've read the documentation numerous times and I still don't understand how this works.

我认为一个类似于我的数据示例，说明如何将其调整为上述版本会很棒.

I think an example of data similar to mine of how it would be adapted to the above version would be great.

我的用例示例:2 个动作:1 和 23个特点:语言、国家、喜爱的运动

Example of my use case: 2 actions: 1 and 2 3 features: language, country, favorite sport

我看过的一些文档:

https://vowpalwabbit.org/tutorials/cb_simulation.html

玩弄它，我用这个输入创建了一个 train.txt:

Playing around with it, I created a train.txt with this input:

shared |user language=en nation=CAN
|action arm=10-OC-ValueProp10 
0:0:0.5 |action arm=11-OC-ValueProp11 

shared |user language=it nation=ITA
|action arm=10-OC-ValueProp10 
0:0:0.5 |action arm=11-OC-ValueProp11 

shared |user language=it nation=ITA
0:0:0.5 |action arm=10-OC-ValueProp10 
|action arm=11-OC-ValueProp11 

shared |user language=it nation=ITA
0:0:0.5 |action arm=10-OC-ValueProp10 
|action arm=11-OC-ValueProp11

但是当我运行这个时:

vw = pyvw.vw("-d full_data.txt --cb_explore_adf -q ua --quiet --epsilon 0.2")
vw.predict("|user language=en nation=USA")

我得到一个没有意义的 [1.0].我确定我做错了什么.

I get a [1.0] which doesn't make sense. I am sure that I am doing something wrong.

推荐答案

ADF 代表动作相关功能.因此，每个事件/示例由多行组成，第一行是一组可选的共享功能(用 shared 标记).

ADF stands for action dependent features. So each event/example consists of multiple lines, with the first line being an optional set of shared features (marked with shared).

除共享行外，每一行对应一个动作.

Apart from the shared line, each line corresponds with an action.

因此，当您向大众提供输入时:

So, when you provide VW with the input:

|user language=en nation=USA

您要求仅对 1 个动作进行预测(因为没有共享线)，这就是为什么您要返回一个 PMF(概率质量函数，或选择每个不同项目的概率)，它只是 [1.0].这表明应该以 1.0 的概率选择单个动作.但是，阅读功能看起来好像您实际上正在传递共享功能应该是什么.

You are asking for a prediction for only 1 action (since there is no shared line), which is why you are getting back a PMF (probability mass function, or the probability to choose each distinct item) which is simply [1.0]. This states the single action should be chosen with a probability of 1.0. However, reading the features it looks as though you are actually passing what should e the shared features.

对于每个预测，您需要为每个动作提供所有特征，因为基本上动作本身被定义为其特征集 (ADF).

For each prediction you need to provide all of the features for each action, as essentially the action itself is defined as the set of its features (ADF).

您的预测数据应该类似于(注意省略了标签):

Your predict data should look something like (notice the label is omitted):

shared |user language=it nation=ITA
|action arm=10-OC-ValueProp10 
|action arm=11-OC-ValueProp11

VW 然后会发出看起来像 [0.9, 0.1] 的东西.然后，您应该从这个 PMF 中采样(以允许探索)以确定哪个是所选操作.

VW will then emit something that looks like [0.9, 0.1]. You should then sample from this PMF (to allow for exploration) to determine which is the chosen action.

训练数据的格式有点混乱，因为从非 adf 中重复使用了相同的格式.标签的 action 部分实际上是未使用的，因为 标签必须在行上作为它的动作.

The format of the training data is a bit confusing since the same format was reused from non-adf. The action portion of the label is actually unused since the label must be on the line as the action it is for.

shared |user language=en nation=CAN
|action arm=10-OC-ValueProp10 
0:0:0.5 |action arm=11-OC-ValueProp11

在上面的例子中，它说这里的动作二的成本为 0，当它被选中时，选择它的概率是 0.5(PMF 中的值)

In the above example it says that action two here had a cost of 0, and when it was picked the probability of choosing it was 0.5 (the value in the PMF)

这篇关于与 Vowpal Wabbit Contextual Bandit 训练数据格式的混淆的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

与 Vowpal Wabbit Contextual Bandit 训练数据格式的混淆 [英] Confusion with Vowpal Wabbit Contextual Bandit training data formatting

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

与 Vowpal Wabbit Contextual Bandit 训练数据格式的混淆 [英] Confusion with Vowpal Wabbit Contextual Bandit training data formatting

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭