寻找相关矩阵 [英] Finding the correlation matrix

查看:27
本文介绍了寻找相关矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的矩阵(大约 50K 行),我想打印矩阵中每一行之间的相关系数.我写过这样的 Python 代码:

I have a matrix which is fairly large (around 50K rows), and I want to print the correlation coefficient between each row in the matrix. I have written Python code like this:

for i in xrange(rows): # rows are the number of rows in the matrix. 
    for j in xrange(i, rows):
        r = scipy.stats.pearsonr(data[i,:], data[j,:])
        print r  

请注意,我正在使用 scipy 模块 (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html).

Please note that I am making use of the pearsonr function available from the scipy module (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html).

我的问题是:有没有更快的方法来做到这一点?我可以使用一些矩阵分区技术吗?

My question is: Is there a quicker way of doing this? Is there some matrix partition technique that I can use?

谢谢!

推荐答案

新解决方案

在看了 Joe Kington 的回答后,我决定研究 corrcoef() 代码,并受到它的启发进行以下实现.

After looking at Joe Kington's answer, I decided to look into the corrcoef() code and was inspired by it to do the following implementation.

ms = data.mean(axis=1)[(slice(None,None,None),None)]
datam = data - ms
datass = np.sqrt(scipy.stats.ss(datam,axis=1))
for i in xrange(rows):
    temp = np.dot(datam[i:],datam[i].T)
    rs = temp / (datass[i:]*datass[i])

每次循环生成第 i 行和第 i 行到最后一行之间的 Pearson 系数.它非常快.它的速度至少是单独使用 corrcoef() 的 1.5 倍,因为它不会冗余计算系数和其他一些东西.它也会更快,并且不会给您带来 50,000 行矩阵的内存问题,因为这样您就可以选择存储每组 r 或在生成另一组之前处理它们.在没有长期存储任何 r 的情况下,我能够在一分钟内在我相当新的笔记本电脑上让上述代码在 50,000 x 10 组随机生成的数据上运行.

Each loop through generates the Pearson coefficients between row i and rows i through to the last row. It is very fast. It is at least 1.5x as fast as using corrcoef() alone because it doesn't redundantly calculate the coefficients and a few other things. It will also be faster and won't give you the memory problems with a 50,000 row matrix because then you can choose to either store each set of r's or process them before generating another set. Without storing any of the r's long term, I was able to get the above code to run on 50,000 x 10 set of randomly generated data in under a minute on my fairly new laptop.

旧解决方案

首先,我不建议将 r 打印到屏幕上.对于 100 行(10 列),打印时间为 19.79 秒,而不使用您的代码为 0.301 秒.如果您愿意,只需存储 r 并在以后使用它们,或者在进行过程中对它们进行一些处理,例如寻找一些最大的 r.

First, I wouldn't recommend printing out the r's to the screen. For 100 rows (10 columns), this is a difference of 19.79 seconds with printing vs. 0.301 seconds without using your code. Just store the r's and use them later if you would like, or do some processing on them as you go along like looking for some of the largest r's.

其次,您可以通过不重复计算某些数量来节省一些费用.Pearson 系数是在 scipy 中使用一些您可以预先计算的数量来计算的,而不是在每次使用一行时进行计算.此外,您没有使用 p 值(它也由 pearsonr() 返回,所以让我们也从头开始.使用以下代码:

Second, you can get some savings by not redundantly calculating some quantities. The Pearson coefficient is calculated in scipy using some quantities that you can precalculate rather than calculating every time that a row is used. Also, you aren't using the p-value (which is also returned by pearsonr() so let's scratch that too. Using the below code:

r = np.zeros((rows,rows))
ms = data.mean(axis=1)

datam = np.zeros_like(data)
for i in xrange(rows):
    datam[i] = data[i] - ms[i]
datass = scipy.stats.ss(datam,axis=1)
for i in xrange(rows):
    for j in xrange(i,rows):
        r_num = np.add.reduce(datam[i]*datam[j])
        r_den = np.sqrt(datass[i]*datass[j])
        r[i,j] = min((r_num / r_den), 1.0)

当我删除 p 值内容时,我比直接 scipy 代码获得了大约 4.8 倍的加速 - 如果我将 p 值内容留在那里,则为 8.8 倍(我使用了 10 列和数百行).我还检查过它确实给出了相同的结果.这并不是一个很大的改进,但它可能会有所帮助.

I get a speed-up of about 4.8x over the straight scipy code when I've removed the p-value stuff - 8.8x if I leave the p-value stuff in there (I used 10 columns with hundreds of rows). I also checked that it does give the same results. This isn't a really huge improvement, but it might help.

最终,您会遇到计算 (50000)*(50001)/2 = 1,250,025,000 Pearson 系数的问题(如果我计算正确的话).好多啊.顺便说一句,实际上没有必要用它自己计算每一行的 Pearson 系数(它等于 1),但这只会使您免于计算 50,000 个 Pearson 系数.使用上面的代码,根据我在较小数据集上的结果,如果您的数据有 10 列,我预计大约需要 4 1/4 小时来进行计算.

Ultimately, you are stuck with the problem that you are computing (50000)*(50001)/2 = 1,250,025,000 Pearson coefficients (if I'm counting correctly). That's a lot. By the way, there's really no need to compute each row's Pearson coefficient with itself (it will equal 1), but that only saves you from computing 50,000 Pearson coefficients. With the above code, I expect that it would take about 4 1/4 hours to do your computation if you have 10 columns to your data based on my results on smaller datasets.

您可以通过将上面的代码放入 Cython 或类似的东西中来获得一些改进.我希望如果你幸运的话,你可能会比直接的 Scipy 提高 10 倍.此外,根据 pyInTheSky 的建议,您可以进行一些多处理.

You can get some improvement by taking the above code into Cython or something similar. I expect that you'll maybe get up to a 10x improvement over straight Scipy if you're lucky. Also, as suggested by pyInTheSky, you can do some multiprocessing.

这篇关于寻找相关矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆