查找列表内列表之间的相关性时出现效率问题 [英] Efficiency issues with finding correlations between lists inside lists
问题描述
如果我有两个小列表,并且想找到 list1 中的每个列表与 list2 中的每个列表之间的相关性,则可以这样做
If I have two small lists and I want to find the correlation between each list inside list1 with each list inside list2, I can do this
from scipy.stats import pearsonr
list1 = [[1,2,3],[4,5,6],[7,8,9],[10,11,12]]
list2 = [[10,20,30],[40,50,60],[77,78,79],[80,78,56]]
corrVal = []
for i in list1:
for j in list2:
corrVal.append(pearsonr(i,j)[0])
print(corrVal)
OUTPUT: [1.0, 1.0, 1.0, -0.90112711377916588, 1.0, 1.0, 1.0, -0.90112711377916588, 1.0, 1.0, 1.0, -0.90112711377916588, 1.0, 1.0, 1.0, -0.90112711377916588]
这算很好...大概。 (编辑:只是注意到我上面的相关输出似乎给出了正确的答案,但是它们重复了4次。不完全确定为什么这样做)
Which works out fine...just about. ( Just noticed my correlation outputs above seem to give the correct answer, but they repeat 4 times. Not exactly sure why its doing that)
但是对于具有列表中有1000多个值,我的代码会无限期冻结,不输出任何错误,因此使我每次都强制退出IDE。有什么想法我要溜到这里吗?不确定pearsonr函数可以处理的数量是否有固有限制,或者我的编码是否引起了问题。
However for larger datasets with 1000s of values in the lists, my code freezes indefinitely, outputting no errors, hence making me force quit my IDE every time. Any ideas where I'm slipping up here? Not sure whether there is an inherent limit to how much the pearsonr function can handle or whether my coding is causing the problem.
推荐答案
scipy模块 scipy.spatial.distance
包含一个称为皮尔森距离的距离函数,即1减去相关系数。通过在 scipy.spatial.distance.cdist
,您可以有效地计算每个向量对中的Pearson相关系数两个输入。
The scipy module scipy.spatial.distance
includes a distance function known as Pearson's distance, which is simply 1 minus the correlation coefficient. By using the argument metric='correlation'
in scipy.spatial.distance.cdist
, you can efficiently compute Pearson's correlation coefficient for each pair of vectors in two inputs.
这里是一个示例。我将修改您的数据,以使系数变化更大:
Here's an example. I'll modify your data so the coefficients are more varied:
In [96]: list1 = [[1, 2, 3.5], [4, 5, 6], [7, 8, 12], [10, 7, 10]]
In [97]: list2 = [[10, 20, 30], [41, 51, 60], [77, 80, 79], [80, 78, 56]]
所以我们知道了什么,这是使用 scipy.stats.pearsonr
计算的相关系数:
So we know what to expect, here are the correlation coefficients computed using scipy.stats.pearsonr
:
In [98]: [pearsonr(x, y)[0] for x in list1 for y in list2]
Out[98]:
[0.99339926779878296,
0.98945694873927104,
0.56362148019067804,
-0.94491118252306794,
1.0,
0.99953863896044937,
0.65465367070797709,
-0.90112711377916588,
0.94491118252306805,
0.93453339271427294,
0.37115374447904509,
-0.99339926779878274,
0.0,
-0.030372836961539348,
-0.7559289460184544,
-0.43355498476205995]
更方便地查看数组中的内容:
It is more convenient to see those in an array:
In [99]: np.array([pearsonr(x, y)[0] for x in list1 for y in list2]).reshape(len(list1), len(list2))
Out[99]:
array([[ 0.99339927, 0.98945695, 0.56362148, -0.94491118],
[ 1. , 0.99953864, 0.65465367, -0.90112711],
[ 0.94491118, 0.93453339, 0.37115374, -0.99339927],
[ 0. , -0.03037284, -0.75592895, -0.43355498]])
以下是使用 cdist
计算的结果:
Here's the same result computed using cdist
:
In [100]: from scipy.spatial.distance import cdist
In [101]: 1 - cdist(list1, list2, metric='correlation')
Out[101]:
array([[ 0.99339927, 0.98945695, 0.56362148, -0.94491118],
[ 1. , 0.99953864, 0.65465367, -0.90112711],
[ 0.94491118, 0.93453339, 0.37115374, -0.99339927],
[ 0. , -0.03037284, -0.75592895, -0.43355498]])
使用 cdist
比调用<$要快 c $ c> pearsonr 嵌套循环。在这里,我将使用两个数组,分别为 data1
和 data2
,每个数组的大小分别为(100,10000):
Using cdist
is much faster than calling pearsonr
in a nested loop. Here I'll use two arrays, data1
and data2
, each with size (100, 10000):
In [102]: data1 = np.random.randn(100, 10000)
In [103]: data2 = np.random.randn(100, 10000)
我将使用便捷的<$ ipython
中的c $ c>%timeit 命令来测量执行时间:
I'll use the convenient %timeit
command in ipython
to measure the execution time:
In [104]: %timeit c1 = [pearsonr(x, y)[0] for x in data1 for y in data2]
1 loop, best of 3: 836 ms per loop
In [105]: %timeit c2 = 1 - cdist(data1, data2, metric='correlation')
100 loops, best of 3: 4.35 ms per loop
嵌套循环的时间为836毫秒, cdist的时间为4.35毫秒
。
That's 836 ms for the nested loop, and 4.35 ms for cdist
.
这篇关于查找列表内列表之间的相关性时出现效率问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!