值介于-1和1之间且总和为1的随机numpy数组 [英] random numpy array whose values are between -1 and 1 and sum to 1

查看:96
本文介绍了值介于-1和1之间且总和为1的随机numpy数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

创建给定 size 的NumPy数组 x 的最佳方法是什么,该数组的值随机(且均匀地)分布在 -1 和 1 ,它们总和为 1 吗?

根据讨论,我尝试了 2 * np.random.rand(size)-1 np.random.uniform(-1,1,size)a href ="https://stackoverflow.com/questions/55769367/numpy-random-array-values-between-1-and-1">此处,但是如果我采用转换方法,请重新然后按两个方法的总和 x/= np.sum(x)缩放这两个方法,这可以确保元素的总和为1,但是:数组中的元素突然大于或小于1(> 1 ,< -1 ).

解决方案

拒绝采样

您可以使用

图片:3个坐标的域.每个有色表面在第4个坐标上都有不同的值.

边际分布

对于大尺寸,由于拒绝的比例随尺寸的数量而增加,因此拒绝采样的效率将降低.

解决"这一问题的一种方法是从边际分布中抽样.但是,计算这些边际分布有点繁琐.比较:为了从Dirichlet分布生成样本,存在

图像:采样输出.4条黑色曲线是四个坐标的边际分布.红色曲线是基于Irwin Hall分布的计算.可以通过直接计算而不是拒绝采样将其扩展为一种采样方法.

python中的拒绝采样

 将numpy导入为npdef采样器(大小):拒绝= 1同时拒绝:x = np.random.rand(size-1)#步骤1S = np.sum(x)拒绝=(S< 0)或(S> 2)#步骤2x = np.append(x,1-S)#步骤3返回[x]y =采样器(5)打印(y,np.sum(y)) 

R中有更多代码,包括与Irwin Hall发行版的比较.此分布可用于计算边际分布,并可用于设计比拒绝采样更有效的算法.

  ###函数做拒绝样本samp<-function(n){<-1##执行第1步(样本)和第2步(比较和)的while循环while((S< 0)||(S> 2)){x<-符(n-1,-1,1)S<-和(x)}x<-c(x,1-S)##步骤3(生成第n个坐标)X}###计算10 ^ 5个样本y<-复制(10 ^ 5,samp(4))###绘图直方图h1 <-hist(y [1,],bre​​aks = seq(-1,1,0.05))h2 <-hist(y [2,],bre​​aks = seq(-1,1,0.05))h3<-hist(y [3,],bre​​aks = seq(-1,1,0.05))h4<-hist(y [4,],bre​​aks = seq(-1,1,0.05))###直方图一起折线图plot(h1 $ mids,h1 $ density,type ='l',ylim = c(0,1),xlab ="x [i]",ylab =频率",main =边际分布")行(h2 $ mids,h2 $ density)行(h3 $ mids,h3 $ density)行(h4 $ mids,h4 $ density)###根据Irwin Hall分布添加分布###欧文·霍尔PDFdih<-function(x,n = 3){k<-0:(floor(x))项<--(-1)^ k *选择(n,k)*(x-k)^(n-1)sum(terms)/prod(1:(n-1))}dih<-向量化(dih)### Irwin Hall CDFpih<-function(x,n = 3){k<-0:(floor(x))项<--(-1)^ k *选择(n,k)*(x-k)^ nsum(terms)/prod(1:(n))}pih<-向量化(pih)###添加行###(请注意,我们需要为Erwin Hall分布缩放变量)xn<-seq(-1,1,0.001)范围<-c(-1,1)暨<-pih(1.5+(1-范围)/2,3)标度<-0.5/(cum [1] -cum [2])###重新规范化###(因数0.5取决于比例差异)行(xn,scale * dih(1.5+(1-xn)/2,3),col = 2) 

what is the best way to create a NumPy array x of a given size with values randomly (and uniformly?) spread between -1 and 1, and that also sum to 1 ?

I tried 2*np.random.rand(size)-1 and np.random.uniform(-1,1,size) based on the discussion here, but if I take a transformation approach, by re-scaling both methods by their sum afterwards, x/=np.sum(x), this ensures the elements sum to 1, but: there are elements in the array that are suddenly much greater or less than 1 (>1, <-1) which is not wanted.

解决方案

Rejection sampling

You can use rejection sampling. The method below does this by sampling in a space of 1 dimension less than the original space.

  • Step 1: you sample x(1), x(2), ..., x(n-1) randomly by sampling each x(i) from a uniform distribution
  • Step 2: if the sum S = x(1) + x(2) + ... + x(n-1) is below 0 or above 2 then reject and start again in Step 1.
  • Step 3: compute the n-th variable as x(n) = 1-S

Intuition

You can view the vector x(1), x(2), ..., x(n-1), x(n) on the interior of a n-dimensional cube with cartesian coordinates ±1, ±1,... , ±1. Such that you follow the constraints -1 <= x(i) <= 1.

The additional constraint that the sum of the coordinates must equal 1 constrains the coordinates to a smaller space than the hypercube and will be a hyperplane with dimension n-1.

If you do regular rejection sampling, sampling from uniform distribution for all the coordinates, then you will never hit the constraint. The sampled point will never be in the hyperplane. Therefore you consider a subspace of n-1 coordinates. Now you can use rejection sampling.

Visually

Say you have dimension 4 then you could plot 3 of the coordinated from the 4. This plot (homogeneously) fills a polyhedron. Below this is illustrated by plotting the polyhedron in slices. Each slice corresponds to a different sum S = x(1) + x(2) + ... + x(n-1) and a different value for x(n).

Image: domain for 3 coordinates. Each colored surface relates to a different value for the 4-th coordinate.

Marginal distributions

For large dimensions, rejection sampling will become less efficient because the fraction of rejections grows with the number of dimensions.

One way to 'solve' this would be by sampling from the marginal distributions. However, it is a bit tedious to compute these marginal distributions. Comparison: For generating samples from a Dirichlet distribution a similar algorithm exists, but in that case, the marginal distributions are relatively easy. (however, it is not impossible to derive these distributions, see below 'relationship with Irwin Hall distribution')

In the example above the marginal distribution of the x(4) coordinate corresponds to the surface area of the cuts. So for 4 dimensions, you might be able to figure out the computation based on that figure (you'd need to compute the area of those irregular polygons) but it starts to get more complicated for larger dimensions.

Relationship with Irwin Hall distribution

To get the marginal distributions you can use truncated Irwin Hall distributions. The Irwin Hall distribution is is the distribution of a sum of uniform distributed variables and will follow some piecewise polynomial shape. This is demonstrated below for one example.

Code

Since my python is rusty I will mostly add R code. The algorithm is very basic and so I imagine that any Python coder can easily adapt it into Python code. The hard part of the question seems to me to be more about the algorithm than about how to code in Python (although I am not a Python coder so I leave that up to others).

Image: output from sampling. The 4 black curves are marginal distributions for the four coordinates. The red curve is a computation based on an Irwin Hall distribution. This can be extended to a sampling method by computing directly instead of rejection sampling.

The rejection sampling in python

import numpy as np

def sampler(size):
   reject = 1
   while reject:
      x = np.random.rand(size - 1) # step 1
      S = np.sum(x)
      reject = (S<0) or (S>2)      # step 2
   x = np.append(x,1-S)            # step 3
   return[x]

y = sampler(5) 
print(y, np.sum(y))

Some more code in R, including the comparison with the Irwin Hall distribution. This distribution can be used to compute the marginal distributions and can be used to devise an algorithm to that is more efficient than rejection sampling.

### function to do rejection sample
samp <- function(n) {
  S <- -1
  ## a while loop that performs step 1 (sample) and 2 (compare sum)
  while((S<0) || (S>2) ) { 
    x <- runif(n-1,-1,1)
    S <- sum(x)
  }
  x <- c(x,1-S) ## step 3 (generate n-th coordinate)
  x
}

### compute 10^5 samples
y <- replicate(10^5,samp(4))

### plot histograms
h1 <- hist(y[1,], breaks = seq(-1,1,0.05))
h2 <- hist(y[2,], breaks = seq(-1,1,0.05))
h3 <- hist(y[3,], breaks = seq(-1,1,0.05))
h4 <- hist(y[4,], breaks = seq(-1,1,0.05))

### histograms together in a line plot
plot(h1$mids,h1$density, type = 'l', ylim = c(0,1),
     xlab = "x[i]", ylab = "frequency", main = "marginal distributions")
lines(h2$mids,h2$density)
lines(h3$mids,h3$density)
lines(h4$mids,h4$density)

### add distribution based on Irwin Hall distribution

### Irwin Hall PDF
dih <- function(x,n=3) {
  k <- 0:(floor(x))   
  terms <- (-1)^k * choose(n,k) *(x-k)^(n-1)
  sum(terms)/prod(1:(n-1))
}
dih <- Vectorize(dih)

### Irwin Hall CDF
pih <- function(x,n=3) {
  k <- 0:(floor(x))   
  terms <- (-1)^k * choose(n,k) *(x-k)^n
  sum(terms)/prod(1:(n))
}
pih <- Vectorize(pih)


### adding the line 
### (note we need to scale the variable for the Erwin Hall distribution)
xn <- seq(-1,1,0.001)

range <- c(-1,1)
cum <- pih(1.5+(1-range)/2,3)
scale <- 0.5/(cum[1]-cum[2]) ### renormalize
                           ### (the factor 0.5 is due to the scale difference)
lines(xn,scale*dih(1.5+(1-xn)/2,3),col = 2)

这篇关于值介于-1和1之间且总和为1的随机numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆