计算大型矩阵的均值和协方差(300000 x 70000) [英] compute the mean and the covariance of a large matrix(300000 x 70000)

查看:87
本文介绍了计算大型矩阵的均值和协方差(300000 x 70000)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Numpy并尝试计算大型矩阵(300000 x 70000)的均值和协方差. 我有32GB的可用内存.就计算效率和实现的简便性而言,此任务的最佳实践是什么?

I am using Numpy and trying to compute the mean and the covariance of a large matrix(300000 x 70000). I have 32GB-size memory avaiable. What's the best practice for this task in term of computational efficiency and easiness of implementation?

我当前的实现如下:

def compute_mean_variance(mat, chunk_size):
    row_count = mat.row_count
    col_count = mat.col_count
    # maintain the `x_sum`, `x2_sum` array
    # mean(x) = x_sum / row_count
    # var(x) = x2_sum / row_count - mean(x)**2
    x_sum = np.zeros([1, col_count])
    x2_sum = np.zeros([1, col_count])

    for i in range(0, row_count, chunk_size):
        sub_mat = mat[i:i+chunk_size, :]
        # in-memory sub_mat of size chunk_size x num_cols
        sub_mat = sub_mat.read().val
        x_sum += np.sum(sub_mat, 0)
        x2_sum += x2_sum + np.sum(sub_mat**2, 0)
    x_mean = x_sum / row_count
    x_var = x2_sum / row_count - x_mean ** 2
    return x_mean, x_var

有什么改进建议吗?

我发现以下实现应该更容易理解.它还使用numpy来计算列块的均值和标准差.因此,它应该更有效并且在数值上稳定.

I find the following implementation should more understandable. Also it use numpy to calculate the mean and standard deviation for the chunks of columns. So it should be more efficient and numerically stable.

def compute_mean_std(mat, chunk_size):
    row_count = mat.row_count
    col_count = mat.col_count
    mean = np.zeros(col_count)
    std = np.zeros(col_count)

    for i in xrange(0, col_count, chunk_size):
        sub_mat = mat[:, i : i + chunk_size]
        # num_samples x chunk_size
        sub_mat = sub_mat.read().val
        mean[i : i + chunk_size] = np.mean(sub_mat, axis=0)
        std[i : i + chunk_size] = np.std(sub_mat, axis=0)

    return mean, std

推荐答案

所以我在大学里有一个项目,其中涉及针对不同矩阵乘法算法的时间复杂度测试.

So I had a project at University that involved time complexity tests for different matrix multiplication algorithms.

我已经在此处上传了源代码.

I've uploaded the source code here.

我发现的一项优化是,您可以通过更改for循环的结构来优化数组访问,以一次集中于行而不是遍历列.这是由于缓存的行为具有空间局部性(例如,您的计算机尝试针对2D数组中并排而不是逐行的数组元素进行优化)

One of the optimizations I found was that you can optimize the array access by changing the structure of your for-loops to focus on rows at a time rather than traversing columns. This is due to the way the caches behave with spatial locality (i.e. your computer tries to optimize for array elements that are side by side rather than row by row in a 2D array)

如果这些是稀疏"矩阵(许多零元素),则可以将数据结构更改为仅记录非零元素.

Also if these are 'sparse' matrices (a lot of zeroed elements) you can change the data structure to only record the non-zeroed elements.

很显然,如果给定一个普通矩阵,那么将它们转换为稀疏矩阵的计算可能不值得,但我只是认为这些值得分享:)

Obviously if you're given a normal matrix, the computation of transforming them to a sparse matrix is probably not worth it, but I just thought these were observations worth sharing :)

这篇关于计算大型矩阵的均值和协方差(300000 x 70000)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆