带有Rcpp的多臂土匪 [英] Multi-armed bandits with Rcpp

查看:146
本文介绍了带有Rcpp的多臂土匪的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从

I am translating the epsilon-greedy algorithm for multiarmed bandits from here. This is a rather nice demonstration of the power and elegance of Rcpp. However, the results from this version do not tally with the one that is mentioned in the link above. I am aware that this is probably a very niche question but have no other venue to post this on!

代码摘要如下.基本上,我们有一组武器,每个武器都以预先确定的概率支付奖励,而我们的工作是通过随机抽取武器,而间歇性地获得最佳奖励,最终使我们收敛保持最佳状态. 约翰·迈尔斯·怀特(John Myles White).

A summary of the code is as follows. Basically, we have a set of arms, each of which pays out a reward with a pre-defined probability and our job is to show that by drawing at random from the arms while drawing the arm with the best reward intermittently eventually allows us to converge on to the best arm. A nice explanation of this algorithm is provided by John Myles White.

现在,输入代码:

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(cpp11)]]

struct EpsilonGreedy {
  double epsilon;
  std::vector<int> counts;
  std::vector<double> values;
};

int index_max(std::vector<double>& v) {
  return std::distance(v.begin(), std::max_element(v.begin(), v.end()));
}

int index_rand(std::vector<double>& v) {
  return R::runif(0, v.size()-1);
}

int select_arm(EpsilonGreedy& algo) {
  if (R::runif(0, 1) > algo.epsilon) {
    return index_max(algo.values);
  } else {
    return index_rand(algo.values);
  }
}

void update(EpsilonGreedy& algo, int chosen_arm, double reward) {
  algo.counts[chosen_arm] += 1;

  int n = algo.counts[chosen_arm];
  double value = algo.values[chosen_arm];

  algo.values[chosen_arm] = ((n-1)/n) * value + (1/n) * reward;
}

struct BernoulliArm {
  double p;
};

int draw(BernoulliArm arm) {
  if (R::runif(0, 1) > arm.p) {
    return 0;
  } else {
    return 1;
  }
}

// [[Rcpp::export]]
DataFrame test_algorithm(double epsilon, std::vector<double>& means, int n_sims, int horizon) {

  std::vector<BernoulliArm> arms;

  for (auto& mu : means) {
    BernoulliArm b = {mu};
    arms.push_back(b);
  }

  std::vector<int> sim_num, time, chosen_arms;
  std::vector<double> rewards;

  for (int sim = 1; sim <= n_sims; ++sim) {

    std::vector<int> counts(means.size(), 0);
    std::vector<double> values(means.size(), 0.0); 

    EpsilonGreedy algo = {epsilon, counts, values};

    for (int t = 1; t <= horizon; ++t) {
      int chosen_arm = select_arm(algo);
      double reward = draw(arms[chosen_arm]);
      update(algo, chosen_arm, reward);

      sim_num.push_back(sim);
      time.push_back(t);
      chosen_arms.push_back(chosen_arm);
      rewards.push_back(reward);
    }
  }

  DataFrame results = DataFrame::create(Named("sim_num") = sim_num,
                                        Named("time") = time,
                                        Named("chosen_arm") = chosen_arms,
                                        Named("reward") = rewards);

  return results;
}

/***R

means <- c(0.1, 0.1, 0.1, 0.1, 0.9)
results <- test_algorithm(0.1, means, 5000, 250) 

p2 <- ggplot(results) + geom_bar(aes(x = chosen_arm)) + theme_bw()
*/

在仿真过程中选择的手臂图(即上面的图p2)如下:

The plot of the arm chosen during simulations (i.e., plot p2 above) is as follows:

很明显,尽管奖励很低,但选择的第一支手臂却不成比例!发生了什么事?

Obviously, the first arm is being chosen disproportionately despite having a low reward! What is going on?

推荐答案

我不知道土匪应该如何工作,但是进行一些标准调试(即查看生成的值)后发现,您生成了许多零.

I don't know how the bandits are supposed to work, but a little standard debugging (ie: look at the values generated) revealed that you generated lots of zeros.

修复了一些基本错误(使您的C/C ++循环for (i=0; i<N; ++)即从零开始并与小于进行比较)之后,我们剩下了其他较小的错误,例如runif(1,N)强制转换为int在N个值上具有相等的范围(提示:加0.5并四舍五入并强制转换,或从1..N个整数集中采样一个整数).

After fixing some elementary errors (make your C/C++ loops for (i=0; i<N; ++) ie start at zero and compare with less-than) we are left with other less subtle errors such as runif(1,N) cast to int not giving you a equal range over N values (hint: add 0.5 and round and cast, or sample one integer from the set of 1..N integers).

但是罪魁祸首似乎是您的第一个论点epsilon.只需将其设置为0.9,就可以得到如下图的图表,您仍然会遇到缺少最后一个半"单位的问题.

But the main culprit seems to be your first argument epsilon. Simply setting that to 0.9 gets me a chart like the following where you still the issue with the last 'half' unit missing.

这篇关于带有Rcpp的多臂土匪的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆