查找列表内列表之间的相关性时出现效率问题 [英] Efficiency issues with finding correlations between lists inside lists

查看:85
本文介绍了查找列表内列表之间的相关性时出现效率问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我有两个小列表,并且想找到 list1 中的每个列表与 list2 中的每个列表之间的相关性,则可以这样做

If I have two small lists and I want to find the correlation between each list inside list1 with each list inside list2, I can do this

from scipy.stats import pearsonr

list1 = [[1,2,3],[4,5,6],[7,8,9],[10,11,12]]
list2 = [[10,20,30],[40,50,60],[77,78,79],[80,78,56]]

corrVal = []
for i in list1:
    for j in list2:
        corrVal.append(pearsonr(i,j)[0])

print(corrVal)

OUTPUT: [1.0, 1.0, 1.0, -0.90112711377916588, 1.0, 1.0, 1.0, -0.90112711377916588, 1.0, 1.0, 1.0, -0.90112711377916588, 1.0, 1.0, 1.0, -0.90112711377916588]

这算很好...大概。 (编辑:只是注意到我上面的相关输出似乎给出了正确的答案,但是它们重复了4次。不完全确定为什么这样做)

Which works out fine...just about. ( Just noticed my correlation outputs above seem to give the correct answer, but they repeat 4 times. Not exactly sure why its doing that)

但是对于具有列表中有1000多个值,我的代码会无限期冻结,不输出任何错误,因此使我每次都强制退出IDE。有什么想法我要溜到这里吗?不确定pearsonr函数可以处理的数量是否有固有限制,或者我的编码是否引起了问题。

However for larger datasets with 1000s of values in the lists, my code freezes indefinitely, outputting no errors, hence making me force quit my IDE every time. Any ideas where I'm slipping up here? Not sure whether there is an inherent limit to how much the pearsonr function can handle or whether my coding is causing the problem.

推荐答案

scipy模块 scipy.spatial.distance 包含一个称为皮尔森距离的距离函数,即1减去相关系数。通过在metric ='correlation'。 distance.cdist.html rel = nofollow> scipy.spatial.distance.cdist ,您可以有效地计算每个向量对中的Pearson相关系数两个输入。

The scipy module scipy.spatial.distance includes a distance function known as Pearson's distance, which is simply 1 minus the correlation coefficient. By using the argument metric='correlation' in scipy.spatial.distance.cdist, you can efficiently compute Pearson's correlation coefficient for each pair of vectors in two inputs.

这里是一个示例。我将修改您的数据,以使系数变化更大:

Here's an example. I'll modify your data so the coefficients are more varied:

In [96]: list1 = [[1, 2, 3.5], [4, 5, 6], [7, 8, 12], [10, 7, 10]]

In [97]: list2 = [[10, 20, 30], [41, 51, 60], [77, 80, 79], [80, 78, 56]]

所以我们知道了什么,这是使用 scipy.stats.pearsonr 计算的相关系数:

So we know what to expect, here are the correlation coefficients computed using scipy.stats.pearsonr:

In [98]: [pearsonr(x, y)[0] for x in list1 for y in list2]
Out[98]: 
[0.99339926779878296,
 0.98945694873927104,
 0.56362148019067804,
 -0.94491118252306794,
 1.0,
 0.99953863896044937,
 0.65465367070797709,
 -0.90112711377916588,
 0.94491118252306805,
 0.93453339271427294,
 0.37115374447904509,
 -0.99339926779878274,
 0.0,
 -0.030372836961539348,
 -0.7559289460184544,
 -0.43355498476205995]

更方便地查看数组中的内容:

It is more convenient to see those in an array:

In [99]: np.array([pearsonr(x, y)[0] for x in list1 for y in list2]).reshape(len(list1), len(list2))
Out[99]: 
array([[ 0.99339927,  0.98945695,  0.56362148, -0.94491118],
       [ 1.        ,  0.99953864,  0.65465367, -0.90112711],
       [ 0.94491118,  0.93453339,  0.37115374, -0.99339927],
       [ 0.        , -0.03037284, -0.75592895, -0.43355498]])

以下是使用 cdist 计算的结果:

Here's the same result computed using cdist:

In [100]: from scipy.spatial.distance import cdist

In [101]: 1 - cdist(list1, list2, metric='correlation')
Out[101]: 
array([[ 0.99339927,  0.98945695,  0.56362148, -0.94491118],
       [ 1.        ,  0.99953864,  0.65465367, -0.90112711],
       [ 0.94491118,  0.93453339,  0.37115374, -0.99339927],
       [ 0.        , -0.03037284, -0.75592895, -0.43355498]])

使用 cdist 比调用<$要快 c $ c> pearsonr 嵌套循环。在这里,我将使用两个数组,分别为 data1 data2 ,每个数组的大小分别为(100,10000):

Using cdist is much faster than calling pearsonr in a nested loop. Here I'll use two arrays, data1 and data2, each with size (100, 10000):

In [102]: data1 = np.random.randn(100, 10000)

In [103]: data2 = np.random.randn(100, 10000)

我将使用便捷的<$ ipython 中的c $ c>%timeit 命令来测量执行时间:

I'll use the convenient %timeit command in ipython to measure the execution time:

In [104]: %timeit c1 = [pearsonr(x, y)[0] for x in data1 for y in data2]
1 loop, best of 3: 836 ms per loop

In [105]: %timeit c2 = 1 - cdist(data1, data2, metric='correlation')
100 loops, best of 3: 4.35 ms per loop

嵌套循环的时间为836毫秒, cdist的时间为4.35毫秒

That's 836 ms for the nested loop, and 4.35 ms for cdist.

这篇关于查找列表内列表之间的相关性时出现效率问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆