计算相关矩阵子集的最快方法 [英] Quickest way to calculate subset of correlation matrix

查看:157
本文介绍了计算相关矩阵子集的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我偏爱将pandas内置的corr方法用于数据帧.但是,我试图计算具有45,000列的数据帧的相关矩阵.然后重复此250次.计算结果使我的ram崩溃了(16 GB,mac book pro).我正在获取结果相关矩阵的列上的统计信息.因此,我需要一列与其他每一列的相关性来计算这些统计信息.我的解决方案是计算列的子集与其他所有列的相关性,但是我需要一种有效的方法来实现此目的.

I'm partial to using pandas builtin corr method for dataframes. However, I am trying to calculate the correlation matrix of a dataframe with 45,000 columns. And then repeat this 250 times. The calculation is crushing my ram (16 GB, mac book pro). I'm grabbing statistics on the columns of the resulting correlation matrix. So I need one column's correlation with every other column to calculate those statistics. My solution is to calculate correlation of a subset of columns with every other column, but I need an efficient way to do this.

考虑:

import pandas as pd
import numpy as np

np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(6, 4), columns=list('ABCD'))
df

我只想计算['A', 'B']

corrs = df.corr()[['A', 'B']]
corrs

我将通过计算平均值或其他统计数据来结束它.

I'll finish it off by calculating the mean or some other stat.

我无法使用用于创建示例的代码,因为当我进行扩展时,我没有足够的存储空间.在执行计算时,它必须使用与选择的列数成正比的内存量来计算相对于其他所有内容的相关性.

I can't use the code I used to create the example because when I scale up, I don't have the memory for it. When performing the calculation, it must use an amount of memory proportional to the number of columns chosen to calculate correlations relative to everything else.

我正在寻找性能最佳的大规模解决方案.我有一个解决方案,但我正在寻找其他想法以确保获得最佳状态.任何提供的返回正确答案(如演示所示)并满足内存限制的答案都将由我投票(并且我也鼓励彼此之间投票).

I'm looking for the most performant solution at scale. I have a solution, but I'm looking for other ideas to ensure I'm getting the best. Any answer provided that returns the correct answer as shown in the demonstration and satisfies the memory constraint will be upvoted by me (and I'd encourage upvoting amongst each other as well).

下面是我的代码:

def corr(df, k=0, l=10):
    d = df.values - df.values.mean(0)
    d_ = d[:, k:l]
    s = d.std(0, keepdims=True)
    return pd.DataFrame(d.T.dot(d[:, k:l]) / s.T.dot(s[:, k:l]) / d.shape[0],
                        df.columns, df.columns[k:l])   

推荐答案

使用点积计算相关性(如您的示例所示)似乎是一个好方法.我将描述两个改进,然后通过代码实现它们.

Using dot products to compute the correlation (as in your example) seems like a good approach. I'll describe two improvements, then code implementing them.

我们可以将均值从点积中抽出,以避免必须从每个值中减去均值(类似于我们也将标准差从点积中拉出的方法.

We can pull the means out of the dot product, to avoid having to subtract them from every value (similar to how you pulled the standard deviations out of the dot product, which we'll also do).

x, y是具有n元素的向量.令a, b为标量.让<x,y>表示x和y之间的点积.

Let x, y be vectors with n elements. Let a, b be scalars. Let <x,y> denote the dot product between x and y.

xy之间的相关性可以使用点积表示.

The correlation between x and y can be expressed using the dot product

<(x-mean(x))/std(x), (y-mean(y))/std(y)> / n

要从点积中提取标准偏差,我们可以使用以下标识(如您在上面所做的那样):

To pull the standard deviations out of the dot product, we can use the following identity (as you did above):

<ax, by> = a*b*<x, y>

要将平均值从点积中取出,我们可以得出另一个标识:

To pull the means out of the dot product, we can derive another identity:

<x+a, y+b> = <x,y> + a*sum(y) + b*sum(x) + a*b*n

a = -mean(x), b = -mean(y)的情况下,简化为:

<x-mean(x), y-mean(y)> = <x, y> - sum(x)*sum(y)/n

使用这些身份,xy之间的相关性等同于:

Using these identities, the correlation between x and y is equivalent to:

(<x, y> - sum(x)*sum(y)/n) / (std(x)*std(y)*n)

在下面的函数中,这将使用矩阵乘法和外部乘积来表示,以同时处理多个变量(如您的示例).

In the function below, this will be expressed using matrix multiplication and outer products to handle multiple variables simultaneously (as in your example).

我们可以预先计算总和和标准差,以避免每次调用该函数时对所有列重新计算它们.

We can pre-compute the sums and standard deviations, to avoid re-computing them for all columns every time the function is called.

将这两项改进放在一起,我们得到以下内容(我不会说熊猫,所以它用的是numpy):

Putting the two improvements together, we have the following (I don't speak pandas, so it's in numpy):

def corr_cols(x, xsum, xstd, lo, hi):
    n = x.shape[0]

    return (
        (np.dot(x.T, x[:, lo:hi]) - np.outer(xsum, xsum[lo:hi])/n)
        / (np.outer(xstd, xstd[lo:hi])*n)
    )

# fake data w/ 10 points, 5 dimensions
x = np.random.rand(10, 5)

# precompute sums and standard deviations along each dimension
xsum = np.sum(x, 0)
xstd = np.std(x, 0)

# calculate columns of correlation matrix for dimensions 1 thru 3
r = corr_cols(x, xsum, xstd, 1, 4)

更好的代码

预计算和存储和与标准差可以隐藏在闭包内部,以提供更好的界面并保持主代码更整洁.从功能上讲,这些操作等效于先前的代码.

Better code

Pre-computing and storing the sums and standard deviations can be hidden inside a closure, to give a nicer interface and keep the main code cleaner. Functionally, the operations are equivalent to the previous code.

def col_correlator(x):
    n = x.shape[0]
    xsum = np.sum(x, 0)
    xstd = np.std(x, 0)

    return lambda lo, hi: (
        (np.dot(x.T, x[:, lo:hi]) - np.outer(xsum, xsum[lo:hi])/n)
        / (np.outer(xstd, xstd[lo:hi])*n)
    )

# construct function to compute columns of correlation matrix
cc = col_correlator(x)

# compute columns of correlation matrix for dimensions 1 thru 3
r = cc(1, 4)


(piRSquared)

我想在这篇文章中发表我的修改内容,以进一步鼓励对此答案的投票.


(piRSquared)

I wanted to put my edit in this post to further encourage upvoting of this answer.

这是我利用此建议实现的代码.此解决方案可在熊猫和numpy之间来回转换.

This is the code I implemented utilizing this advice. This solution translates back and forth between pandas and numpy.

def corr_closure(df):
    d = df.values
    sums = d.sum(0, keepdims=True)
    stds = d.std(0, keepdims=True)
    n = d.shape[0]

    def corr(k=0, l=10):
        d2 = d.T.dot(d[:, k:l])
        sums2 = sums.T.dot(sums[:, k:l])
        stds2 = stds.T.dot(stds[:, k:l])

        return pd.DataFrame((d2 - sums2 / n) / stds2 / n,
                            df.columns, df.columns[k:l])

    return corr

用例:

corr = corr_closure(df)

corr(0, 2)

这篇关于计算相关矩阵子集的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆