当样本大小和概率变化时进行有效的多项采样 [英] Efficient multinomial sampling when sample size and probability vary
问题描述
这个问题涉及从具有不同样本大小和概率的多项式分布中进行有效采样.下面我将描述我使用的方法,但是想知道是否可以通过一些智能矢量化来改进它.
This question pertains to efficient sampling from multinomial distributions with varying sample sizes and probabilities. Below I describe the approach I have used, but wonder whether it can be improved with some intelligent vectorisation.
我正在模拟生物在多个种群中的扩散.人群j
中的个体以概率p[i, j]
分散到人群i
中.给定种群1的初始丰度为10,并且分别向种群1、2和3分散c(0.1, 0.3, 0.6)
的概率,我们可以使用rmultinom
模拟分散过程:
I'm simulating dispersal of organisms amongst multiple populations. Individuals from population j
disperse to population i
with probability p[i, j]
. Given an initial abundance of 10 in population 1, and probabilities of dispersal c(0.1, 0.3, 0.6)
to populations 1, 2, and 3, respectively, we can simulate the dispersal process with rmultinom
:
set.seed(1)
rmultinom(1, 10, c(0.1, 0.3, 0.6))
# [,1]
# [1,] 0
# [2,] 3
# [3,] 7
我们可以将其扩展为考虑n
个来源人群:
We can extend this to consider n
source populations:
set.seed(1)
n <- 3
p <- replicate(n, diff(c(0, sort(runif(n-1)), 1)))
X <- sample(100, n)
上面的
,p
是从一个总体(列)移动到另一总体(行)的概率矩阵,而X
是初始总体大小的向量.现在可以使用以下公式模拟散布在每对人口之间的个人数量(以及剩余的人口数量):
Above, p
is a matrix of probabilities of moving from one population (column) to another (row), and X
is a vector of initial population sizes. The number of individuals dispersing between each pair of populations (and those remaining where they are) can now be simulated with:
sapply(seq_len(ncol(p)), function(i) {
rmultinom(1, X[i], p[, i])
})
# [,1] [,2] [,3]
# [1,] 19 42 11
# [2,] 8 18 43
# [3,] 68 6 8
其中,在第i
行和第j
列的元素的值是从总体j
迁移到总体i
的个体数量.该矩阵的rowSums
给出了新的人口规模.
where the value of the element at the i
th row and j
th column is the number of individuals moving from population j
to population i
. The rowSums
of this matrix give the new population sizes.
我想用恒定的概率矩阵但是用变化的(预定义的)初始丰度重复多次.下面的小示例可以实现此目的,但是在遇到较大问题时效率不高.所得矩阵为5个模拟中的每个模拟提供了三个种群中每个种群的分散后丰度,这些模拟的种群具有不同的初始丰度.
I'd like to repeat this many times, with constant probability matrix but with varying (pre-defined) initial abundances. The following small example achieves this, but is inefficient with larger problems. The resulting matrix gives the post-dispersal abundance in each of three populations for each of 5 simulations for which population had different initial abundances.
X <- matrix(sample(100, n*5, replace=TRUE), nrow=n)
apply(sapply(apply(X, 2, function(x) {
lapply(seq_len(ncol(p)), function(i) {
rmultinom(1, x[i], p[, i])
})
}), function(x) do.call(cbind, x), simplify='array'), 3, rowSums)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 79 67 45 28 74
# [2,] 92 99 40 19 52
# [3,] 51 45 16 21 35
有没有一种方法可以更好地向量化此问题?
Is there a way to better vectorise this problem?
推荐答案
这是RcppGSL多项式的实现.但是,它要求您独立安装gsl....可能不太实用.
This is a RcppGSL implementation of multi-multinomial. However, it requires you to install gsl independently....which may not be very practical.
// [[Rcpp::depends(RcppGSL)]]
#include <RcppGSL.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <unistd.h> // getpid
Rcpp::IntegerVector rmn(unsigned int N, Rcpp::NumericVector p, gsl_rng* r){
size_t K = p.size();
Rcpp::IntegerVector x(K);
gsl_ran_multinomial(r, K, N, p.begin(), (unsigned int *) x.begin());
return x; // return results vector
}
Rcpp::IntegerVector gsl_mmm_1(Rcpp::IntegerVector N, Rcpp::NumericMatrix P, gsl_rng* r){
size_t K = N.size();
int i;
Rcpp::IntegerVector x(K);
for(i=0; i<K; i++){
x += rmn(N[i], P(Rcpp::_, i), r);
}
return x;
}
// [[Rcpp::export]]
Rcpp::IntegerMatrix gsl_mmm(Rcpp::IntegerMatrix X_, Rcpp::NumericMatrix P){
int j;
gsl_rng * r = gsl_rng_alloc (gsl_rng_mt19937);
long seed = rand()/(((double)RAND_MAX + 1)/10000000) * getpid();
gsl_rng_set (r, seed);
Rcpp::IntegerMatrix X(X_.nrow(), X_.ncol());
for(j=0; j<X.ncol(); j++){
X(Rcpp::_, j) = gsl_mmm_1(X_(Rcpp::_,j), P, r);
}
gsl_rng_free (r);
return X;
}
我还将它与纯R实现和jbaums的版本进行比较
I also compare it with a pure R implementation and jbaums's version
library(Rcpp)
library(microbenchmark)
sourceCpp("gsl.cpp")
P = matrix(c(c(0.1,0.2,0.7),c(0.3,0.3,0.4),c(0.5,0.3,0.2)),nc=3)
X = matrix(c(c(30,40,30),c(20,40,40)), nc=2)
mmm = function(X, P){
n = ncol(X)
p = nrow(X)
Reduce("+", lapply(1:p, function(j) {
Y = matrix(0,p,n)
for(i in 1:n) Y[,i] = rmultinom(1, X[j,i], P[,j])
Y
}))
}
jbaums = function(X,P){
apply(sapply(apply(X, 2, function(x) {
lapply(seq_len(ncol(P)), function(i) {
rmultinom(1, x[i], P[, i])
})
}), function(x) do.call(cbind, x), simplify='array'), nrow(X), rowSums)
}
microbenchmark(jbaums(X,P), mmm(X,P), gsl_mmm(X, P))
这就是结果
> microbenchmark(jbaums(X,P), mmm(X,P), gsl_mmm(X, P))
Unit: microseconds
expr min lq median uq max neval
jbaums(X, P) 165.832 172.8420 179.185 187.2810 339.280 100
mmm(X, P) 60.071 63.5955 67.437 71.5775 92.963 100
gsl_mmm(X, P) 10.529 11.8800 13.671 14.6220 40.857 100
gsl版本比我的纯R版本快6倍.
The gsl version is about 6x faster than my pure R version.
这篇关于当样本大小和概率变化时进行有效的多项采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!