我可以使用dask创建multivariate_normal矩阵吗? [英] Can I create a multivariate_normal matrix using dask?
问题描述
与此帖子有关的内容,我正在尝试复制 dask
中的 multivariate_normal
:
使用numpy,我可以使用以下方法创建具有指定协方差的多元正态矩阵: / p>
Somewhat related to this post, I am trying to replicate multivariate_normal
in dask
:
Using numpy I can create a multivariate normal matrix with a specified covariance using:
import numpy as np
n_dim = 5
size = 300
A = np.random.randn(n_dim, n_dim) # a matrix
covm = A.dot(A.T) # A*A^T is positive semi-definite, as a covariance matrix
x = np.random.multivariate_normal(size=300, mean=np.zeros(len(covm)),cov=covm) # generate data
但是我需要一个很大的矩阵其中 n_dim = 4_500_000
和 size = 100000
。计算CPU和内存的成本将非常昂贵。幸运的是,我可以访问Cloudera DataScience工作台集群,并尝试使用 dask
解决此问题:
I however need a significantly large matrix with n_dim = 4_500_000
and size = 100000
. This will be expensive to compute both with respective to CPU and memory. Fortunately, I have access to a Cloudera DataScience Workbench Cluster and was trying to solve this using dask
:
import dask.array as da
n_dim = 4_500_000
size = 100000
A = da.random.standard_normal((n_dim, n_dim))
covm = A.dot(A.T)
#x = da.random.multivariate_normal(size=300, mean=np.zeros(len(covm)),cov=covm) # generate data
在文档,我找不到任何似乎可以完成所需功能的函数。有谁知道解决方案/工作环境,可能使用 xarray
或在群集上运行的任何其他模块?
In the documentation, I cannot find any function that seem to do what I need it to. Does anyone know a solution / workarround, possibly using xarray
or any other module that runs on clusters?
推荐答案
目前的一项工作是使用cholesky分解。注意,任何协方差矩阵C都可以表示为C = G * G'。然后,如果y为标准正态,则x = G'* y如C中指定的那样相关(请参阅此关于StackExchange数学的出色文章)。在代码中:
An work arround for now, is to use a cholesky decomposition. Note that any covariance matrix C can be expressed as C=G*G'. It then follows that x = G'*y is correlated as specified in C if y is standard normal (see this excellent post on StackExchange Mathematic). In code:
Numpy
n_dim =4
size = 100000
A = np.random.randn(n_dim, n_dim)
covm = A.dot(A.T)
x= np.random.multivariate_normal(size=size, mean=np.zeros(len(covm)),cov=covm)
## verify numpys covariance is correct
np.cov(x, rowvar=False)
covm
黄昏
## create covariance matrix
A = da.random.standard_normal(size=(n_dim, n_dim),chunks=(2,2))
covm = A.dot(A.T)
## get cholesky decomp
L = da.linalg.cholesky(covm, lower=True)
## drawn standard normal
sn= da.random.standard_normal(size=(size, n_dim),chunks=(100,100))
## correct for correlation
x =L.dot(sn.T)
x.shape
## verify
covm.compute()
da.cov(x, rowvar=True).compute()
这篇关于我可以使用dask创建multivariate_normal矩阵吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!