在 Python 中矢量化多元正态 CDF(累积密度函数) [英] Vectorizing the multivariate normal CDF (cumulative density function) in Python

查看:164
本文介绍了在 Python 中矢量化多元正态 CDF(累积密度函数)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在 Python 中矢量化多元正态 CDF(累积密度函数)?

How can I vectorize the multivariate normal CDF (cumulative density function) in Python?

在查看这篇帖子时,我发现有是移植"的多变量CDF的Fortran实现.到 Python.这意味着我可以针对一个特定案例轻松评估 CDF.

When looking at this post, I found out that there is a Fortran implementation of the multivariate CDF that was "ported" over to Python. This means I can easily evaluate the CDF for one specific case.

但是,我在将此函数有效应用于多个条目时遇到了很多麻烦.

However, I'm having a lot of trouble efficiently applying this function to multiple entries.

具体来说,我需要向量化"的函数需要 4 个参数:

Specifically speaking, the function I need to "vectorize" takes 4 arguments:

  • 积分的下限(向量)
  • 积分的上限(向量)
  • 正态随机变量(向量)的均值
  • 正态随机变量的协方差矩阵(矩阵)

但我正在尝试多次对 1000 多个元素的列表有效地评估此函数.

But I'm trying to efficiently evaluate this function over a list of 1000+ elements MANY times.

这是一些代码来说明我的问题.在下面的例子中,我只是使用随机数据来说明我的观点.

Here is some code to illustrate my problem. In the example below I'm just using random data to illustrate my point.

import time
import numpy as np
from scipy.stats.mvn import mvnun # library that calculates MVN CDF

np.random.seed(666)

iters = 1000 # number of times the whole dataset will be evaluated
obs = 1500 # number of elements in the dataset
dim = 2 # dimension of multivariate normal distribution

lower = np.random.rand(obs,dim)
upper = lower + np.random.rand(obs,dim)
means = np.random.rand(obs,dim)

# Creates a symmetric matrix - used for the random covariance matrices
def make_sym_matrix(dim,vals):
    m = np.zeros([dim,dim])
    xs,ys = np.triu_indices(dim,k=1)
    m[xs,ys] = vals[:-dim]
    m[ys,xs] = vals[:-dim]
    m[ np.diag_indices(dim) ] = vals[-dim:]
    return m

# Generating the random covariance matrices
covs = []
for i in range(obs):
    cov_vals = np.random.rand(int((dim**2 + dim)/2))
    cov_mtx = make_sym_matrix(dim,cov_vals)
    covs.append(cov_mtx)
covs = np.array(covs)

# Here is where the trouble starts.
time_start = time.time()
for i in range(iters):
    results = []
    for j in range(obs):
        this_p, this_i = mvnun(lower[j],upper[j],means[j],covs[j])
        results.append(this_p)
time_end = time.time()

print(time_end-time_start)

这里我有一个包含 1500 个观察值的数据集,我对其进行了 1000 次评估.在我的机器上,这需要 6.74399995804 秒来计算.

Here I have a dataset with 1500 observations which I'm evaluating 1000 times. In my machine, this takes 6.74399995804 seconds to calculate.

请注意,我并不是要摆脱外部 for 循环(在 i 上).我只是创建它来模仿我的真正问题.我真正想要消除的 for 循环是内部循环(在 j 上).

Note that I'm not trying to get rid of the outer for-loop (over i). I just created that to mimic my real problem. The for-loop that I'm really trying to eliminate is the internal one (over j).

如果我找到一种方法来有效评估整个数据集的 CDF,执行时间可以大大减少.

The execution time could be vastly reduced if I found a way to efficiently evaluate the CDF over the whole dataset.

我知道 mvnun 函数最初是用 Fortran 编写的(原始代码 此处) 和移植"可以在此处看到使用 f2pye 到 Python.

I know that the mvnun function was originally written in Fortran (original code here) and "ported" to Python using f2pye as can be seen here.

谁能帮我解决这个问题?我已经开始研究 theano,但似乎我唯一的选择是使用 scan 功能,这也可能没有太大的改进.

Can anyone help me out with this? I've started looking at theano, but it seems that the only option I have there is to use the scan function, which might not be much of an improvement either.

谢谢!!!

推荐答案

这只是部分答案,但如果多元正态分布的维数小(2 或 3),则有一种方法可以提高速度,如果协方差矩阵保持不变.

This is only a partial answer, but there is a way to increase speed if the dimension of the multivariate normal distribution is small (2 or 3) and if the covariance matrix stays the same.

import numpy as np
import openturns as ot

def computeRectangularDomainProbability(lower, upper, means, cov_matrix):
    """
    Compute the probability of a rectangular solid
    under a multinormal distribution.
    
    """
    # Center the bounds of the rectangular solid on the mean
    lower -= means
    upper -= means
    
    # The same covariance matrix for all rectangular solids.
    cov_matrix = ot.CovarianceMatrix(cov_matrix)
    
    # This way, we only need to define one multivariate normal distribution.
    # That is the trick that allows vectorization.
    dimension = lower.shape[1]   
    multinormal = ot.Normal([0.0] * dimension, cov_matrix)
    
    # The probability of the rectangular solid is a weighted sum
    # of the CDF of the vertices (with weights equal to 1 or -1).
    # The following block computes the CDFs and applies the correct weight.
    full_reverse_binary = np.array(list(bin(2**dimension)[:1:-1]), dtype=int)
    prob = 0.0
    for i in range(2**dimension):
        reverse_binary = np.array(list(bin(i)[:1:-1]), dtype=int)
        reverse_binary = np.append(reverse_binary, 
                                   np.zeros(len(full_reverse_binary) - 
                                            len(reverse_binary) -
                                            1)).astype(int)
        point = np.zeros(lower.shape)
        for num, digit in enumerate(reverse_binary):
            if digit:
                point[:, num] = upper[:, num]
            else:
                point[:, num] = lower[:, num]
        cdf = np.array(multinormal.computeCDF(point))
        if (reverse_binary.sum() % 2) == (dimension % 2):
            prob += cdf
        else:
            prob -= cdf
    
    return prob.reshape(-1,)

测试脚本:维度 2

iters = 10 # loop size
obs = 1500 # number of rectangular solids
dim = 2 # dimension of multivariate normal distribution

import time
import numpy as np
from scipy.stats.mvn import mvnun # library that calculates MVN CDF
from sklearn.datasets import make_spd_matrix
import openturns as ot

time_mvnun = 0.0
time_openturns = 0.0
discrepancy = 0.0
np.random.seed(0)

for iteration in range(iters):

    lower = np.random.rand(obs,dim)
    upper = lower + np.random.rand(obs,dim)
    means = np.random.rand(obs,dim)
    
    # Generating the random covariance matrices with sklearn
    # to make sure they are positive semi-definite        
    cov_mtx = make_spd_matrix(dim)
    
    
    # mvnun is 100 times faster than openturns
    
    time_start = time.time()
    results = []
    for j in range(obs):
        this_p, this_i = mvnun(lower[j],upper[j],means[j],cov_mtx)
        results.append(this_p)
    results = np.array(results)
    time_end = time.time()
    time_mvnun += time_end - time_start
    
    
    time_start = time.time()       
    otparallel = computeRectangularDomainProbability(lower, upper, means, cov_mtx)
    time_end = time.time()
    time_openturns += time_end - time_start
    
    mvnorm_vs_otparallel = np.abs(results - otparallel).sum()
    discrepancy += mvnorm_vs_otparallel

print('Dimension {}'.format(dim))

# Print computation time
print('mvnun     time: {0:e}'.format(time_mvnun))
print('openturns time: {0:e}'.format(time_openturns))
print('ratio mvnun/ot: {0:f}'.format(time_mvnun / time_openturns))

# Check that the results are the same for mvnum and openturns
print('mvnun-openturns result discrepancy: {0:e}'.format(discrepancy))

我机器上的输出:

Dimension 2
mvnun     time: 4.040635e+00
openturns time: 3.588211e+00
ratio mvnun/ot: 1.126086
mvnun-openturns result discrepancy: 8.057912e-11

略有加速:略高于 10%.

There is a slight speedup: a little more than 10%.

让我们更改控制脚本的全局变量.

Let us change the global variables controlling the script.

iters = 100 # loop size
obs = 1500 # number of rectangular solids
dim = 3 # dimension of multivariate normal distribution

我机器上的输出:

Dimension 3
mvnun     time: 2.378337e+01
openturns time: 1.596872e+00
ratio mvnun/ot: 14.893725
mvnun-openturns result discrepancy: 4.537064e-03

在维度 3 中的增益更显着:建议的代码快 15 倍.

The gain is much more significant in dimension 3: the proposed code is 15 times faster.

不幸的是,openturns 在维度 4 中速度明显变慢.它包含维度 1、2 和 3 的 CDF 的智能实现,但在维度大于 3 时回退到更慢、更通用的实现.

Unfortunately, openturns slows down a lot with dimension 4. It contains a smart implementation of the CDF for dimensions 1, 2 and 3 but falls back on a slower, more generic implementation from dimensions greater than 3.

iters = 1 # loop size
obs = 15 # number of rectangular solids
dim = 4 # dimension of multivariate normal distribution

Dimension 4
mvnun     time: 7.289171e-03
openturns time: 3.689714e+01
ratio mvnun/ot: 0.000198
mvnun-openturns result discrepancy: 6.297527e-07

在第 4 维中,建议的代码慢了大约 4 个数量级!这可能是因为在第 4 维中,它需要为每个长方体计算 16=2^4 个 CDF,并且这些计算中的每一个都比较小的维度慢.

In dimension 4, the proposed code is slower by about 4 orders of magnitude! This is probably because in dimension 4, it needs to compute 16=2^4 CDFs per rectangular solid, and each of these computations is slower than in smaller dimensions.

这篇关于在 Python 中矢量化多元正态 CDF(累积密度函数)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆