当样本大小和概率变化时进行有效的多项采样 [英] Efficient multinomial sampling when sample size and probability vary

查看:154
本文介绍了当样本大小和概率变化时进行有效的多项采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题涉及从具有不同样本大小和概率的多项式分布中进行有效采样.下面我将描述我使用的方法,但是想知道是否可以通过一些智能矢量化来改进它.

This question pertains to efficient sampling from multinomial distributions with varying sample sizes and probabilities. Below I describe the approach I have used, but wonder whether it can be improved with some intelligent vectorisation.

我正在模拟生物在多个种群中的扩散.人群j中的个体以概率p[i, j]分散到人群i中.给定种群1的初始丰度为10,并且分别向种群1、2和3分散c(0.1, 0.3, 0.6)的概率,我们可以使用rmultinom模拟分散过程:

I'm simulating dispersal of organisms amongst multiple populations. Individuals from population j disperse to population i with probability p[i, j]. Given an initial abundance of 10 in population 1, and probabilities of dispersal c(0.1, 0.3, 0.6) to populations 1, 2, and 3, respectively, we can simulate the dispersal process with rmultinom:

set.seed(1)
rmultinom(1, 10, c(0.1, 0.3, 0.6))

#      [,1]
# [1,]    0
# [2,]    3
# [3,]    7

我们可以将其扩展为考虑n个来源人群:

We can extend this to consider n source populations:

set.seed(1)
n <- 3
p <- replicate(n, diff(c(0, sort(runif(n-1)), 1)))
X <- sample(100, n)

上面的

p是从一个总体(列)移动到另一总体(行)的概率矩阵,而X是初始总体大小的向量.现在可以使用以下公式模拟散布在每对人口之间的个人数量(以及剩余的人口数量):

Above, p is a matrix of probabilities of moving from one population (column) to another (row), and X is a vector of initial population sizes. The number of individuals dispersing between each pair of populations (and those remaining where they are) can now be simulated with:

sapply(seq_len(ncol(p)), function(i) {
  rmultinom(1, X[i], p[, i])  
})

#      [,1] [,2] [,3]
# [1,]   19   42   11
# [2,]    8   18   43
# [3,]   68    6    8

其中,在第i行和第j列的元素的值是从总体j迁移到总体i的个体数量.该矩阵的rowSums给出了新的人口规模.

where the value of the element at the ith row and jth column is the number of individuals moving from population j to population i. The rowSums of this matrix give the new population sizes.

我想用恒定的概率矩阵但是用变化的(预定义的)初始丰度重复多次.下面的小示例可以实现此目的,但是在遇到较大问题时效率不高.所得矩阵为5个模拟中的每个模拟提供了三个种群中每个种群的分散后丰度,这些模拟的种群具有不同的初始丰度.

I'd like to repeat this many times, with constant probability matrix but with varying (pre-defined) initial abundances. The following small example achieves this, but is inefficient with larger problems. The resulting matrix gives the post-dispersal abundance in each of three populations for each of 5 simulations for which population had different initial abundances.

X <- matrix(sample(100, n*5, replace=TRUE), nrow=n)

apply(sapply(apply(X, 2, function(x) {
  lapply(seq_len(ncol(p)), function(i) {
    rmultinom(1, x[i], p[, i])  
  })
}), function(x) do.call(cbind, x), simplify='array'), 3, rowSums)

#      [,1] [,2] [,3] [,4] [,5]
# [1,]   79   67   45   28   74
# [2,]   92   99   40   19   52
# [3,]   51   45   16   21   35

有没有一种方法可以更好地向量化此问题?

Is there a way to better vectorise this problem?

推荐答案

这是RcppGSL多项式的实现.但是,它要求您独立安装gsl....可能不太实用.

This is a RcppGSL implementation of multi-multinomial. However, it requires you to install gsl independently....which may not be very practical.

// [[Rcpp::depends(RcppGSL)]]

#include <RcppGSL.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <unistd.h>            // getpid

Rcpp::IntegerVector rmn(unsigned int N, Rcpp::NumericVector p, gsl_rng* r){

    size_t K = p.size();

    Rcpp::IntegerVector x(K);
    gsl_ran_multinomial(r, K, N, p.begin(), (unsigned int *) x.begin());
    return x;             // return results vector
}

Rcpp::IntegerVector gsl_mmm_1(Rcpp::IntegerVector N, Rcpp::NumericMatrix P, gsl_rng* r){
    size_t K = N.size();
    int i;
    Rcpp::IntegerVector x(K);
    for(i=0; i<K; i++){
        x += rmn(N[i], P(Rcpp::_, i), r);
    }
    return x;
}

// [[Rcpp::export]]
Rcpp::IntegerMatrix gsl_mmm(Rcpp::IntegerMatrix X_, Rcpp::NumericMatrix P){
    int j;
    gsl_rng * r = gsl_rng_alloc (gsl_rng_mt19937);
    long seed = rand()/(((double)RAND_MAX + 1)/10000000) * getpid();
    gsl_rng_set (r, seed);
    Rcpp::IntegerMatrix X(X_.nrow(), X_.ncol());
    for(j=0; j<X.ncol(); j++){
        X(Rcpp::_, j) = gsl_mmm_1(X_(Rcpp::_,j), P, r);
    }
    gsl_rng_free (r);
    return X;
}

我还将它与纯R实现和jbaums的版本进行比较

I also compare it with a pure R implementation and jbaums's version

library(Rcpp)
library(microbenchmark)
sourceCpp("gsl.cpp")

P = matrix(c(c(0.1,0.2,0.7),c(0.3,0.3,0.4),c(0.5,0.3,0.2)),nc=3)
X = matrix(c(c(30,40,30),c(20,40,40)), nc=2)

mmm = function(X, P){
    n = ncol(X)
    p = nrow(X)
    Reduce("+", lapply(1:p, function(j) {
        Y = matrix(0,p,n)
        for(i in 1:n) Y[,i] = rmultinom(1, X[j,i], P[,j])
        Y
    }))
}

jbaums = function(X,P){
    apply(sapply(apply(X, 2, function(x) {
      lapply(seq_len(ncol(P)), function(i) {
        rmultinom(1, x[i], P[, i])
      })
    }), function(x) do.call(cbind, x), simplify='array'), nrow(X), rowSums)
}
microbenchmark(jbaums(X,P), mmm(X,P), gsl_mmm(X, P))

这就是结果

> microbenchmark(jbaums(X,P), mmm(X,P), gsl_mmm(X, P))
Unit: microseconds
          expr     min       lq  median       uq     max neval
  jbaums(X, P) 165.832 172.8420 179.185 187.2810 339.280   100
     mmm(X, P)  60.071  63.5955  67.437  71.5775  92.963   100
 gsl_mmm(X, P)  10.529  11.8800  13.671  14.6220  40.857   100

gsl版本比我的纯R版本快6倍.

The gsl version is about 6x faster than my pure R version.

这篇关于当样本大小和概率变化时进行有效的多项采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆