两个特征矩阵的有效成对相关 [英] Efficient pairwise correlation for two matrices of features

查看:29
本文介绍了两个特征矩阵的有效成对相关的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Python 中,我需要找到矩阵 A 中的所有特征与矩阵 B 中的所有特征之间的成对相关性.特别是,我很感兴趣地发现 A 中的给定特征在 B 中的所有特征中具有的最强 Pearson 相关性.我不在乎最强的相关性是正相关还是负相关.

我使用下面的两个循环和 scipy 做了一个低效的实现.但是,我想使用 np.corrcoef 或其他类似的方法来有效地计算它.矩阵 A 的形状为 40000x400,B 的形状为 40000x1440.我尝试有效地做这件事可以在下面看到作为方法 find_max_absolute_corr(A,B).但是,它失败并出现以下错误:

ValueError: 除了串联轴之外的所有输入数组维度必须完全匹配.

将 numpy 导入为 np从 scipy.stats 导入 pearsonrdef find_max_absolute_corr(A, B):""" 为 `A` 中的每个特征找到最高的 PearsonB"中所有特征的相关性."""max_corr_A = np.zeros((A.shape[1]))对于范围内的 A_col(A.shape[1]):打印计算 {}/{}.".format(A_col+1, A.shape[1])公制 = A[:,A_col]pearson = np.corrcoef(B, metric, rowvar=0)# 也考虑负相关min_p = 分钟(皮尔逊)max_p = 最大值(皮尔逊)max_corr_A[A_col] = max_absolute(min_p, max_p)返回 max_corr_Adef max_absolute(min_p, max_p):如果 np.isnan(min_p) 或 np.isnan(max_p):引发 ValueError("NaN 相关性.")如果 abs(max_p) >绝对(min_p):返回 max_p别的:返回 min_p如果 __name__ == '__main__':A = np.array([[10, 8.04, 9.14, 7.46],[8, 6.95, 8.14, 6.77],[13, 7.58, 8.74, 12.74],[9, 8.81, 8.77, 7.11],[11, 8.33, 9.26, 7.81]])B = np.array([[-14, -9.96, 8.10, 8.84, 8, 7.04],[-6, -7.24, 6.13, 6.08, 5, 5.25],[-4, -4.26, 3.10, 5.39, 8, 5.56],[-12, -10.84, 9.13, 8.15, 5, 7.91],[-7, -4.82, 7.26, 6.42, 8, 6.89]])# 简单,低效的方法对于范围内的 A_col(A.shape[1]):high_corr = 0对于范围内的 B_col(B.shape[1]):corr,_ = pearsonr(A[:,A_col], B[:,B_col])high_corr = max_absolute(high_corr, corr)打印 high_corr# -0.161314601631# 0.956781516149# 0.621071009239# -0.421539304112# 高效的方法max_corr_A = find_max_absolute_corr(A, B)打印 max_corr_A# [-0.161314601631,# 0.956781516149,# 0.621071009239,# -0.421539304112]

解决方案

似乎

基于该公式,您可以轻松矢量化,因为 AB 中的列的成对计算是相互独立的.这是一个使用 broadcasting 的矢量化解决方案-

# 获取 A 或 B 中的行数N = B.shape[0]# 在 A 和 B 中按列存储,因为它们将在少数地方使用sA = A.sum(0)sB = B.sum(0)# 公式中基本上有四个部分.我们会一一计算它们p1 = N*np.einsum('ij,ik->kj',A,B)p2 = sA*sB[:,无]p3 = N*((B**2).sum(0)) - (sB**2)p4 = N*((A**2).sum(0)) - (sA**2)# 最后将 Pearson 相关系数计算为二维数组pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))# 沿列获取绝对argmax对应的元素out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]

样品运行 -

1) 输入:

在[12]中:A出[12]:数组([[ 10. , 8.04, 9.14, 7.46],[8., 6.95, 8.14, 6.77],[13., 7.58, 8.74, 12.74],[9., 8.81, 8.77, 7.11],[ 11., 8.33, 9.26, 7.81]])在 [13] 中:B出[13]:数组([[-14. , -9.96, 8.1 , 8.84, 8. , 7.04],[ -6., -7.24, 6.13, 6.08, 5. , 5.25],[ -4., -4.26, 3.1, 5.39, 8., 5.56],[-12., -10.84, 9.13, 8.15, 5., 7.91],[-7., -4.82, 7.26, 6.42, 8., 6.89]])

2) 原始循环代码运行 -

在[14]中:high_corr_out = np.zeros(A.shape[1])...:对于范围内的 A_col(A.shape[1]):...:high_corr = 0...:对于范围内的 B_col(B.shape[1]):...: corr,_ = pearsonr(A[:,A_col], B[:,B_col])...: high_corr = max_absolute(high_corr, corr)...:high_corr_out[A_col] = high_corr...:在 [15] 中:high_corr_out出[15]:数组([0.8067843,0.95678152,0.74016181,-0.85127779])

3) 建议的代码运行 -

在[16]中:N = B.shape[0]...: sA = A.sum(0)...: sB = B.sum(0)...: p1 = N*np.einsum('ij,ik->kj',A,B)...: p2 = sA*sB[:,None]...: p3 = N*((B**2).sum(0)) - (sB**2)...: p4 = N*((A**2).sum(0)) - (sA**2)...: pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))...: out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]...:在 [17]: pcorr # Pearson 相关系数数组出[17]:数组([[ 0.41895565, -0.5910935, -0.40465987, 0.5818286 ],[ 0.66609445, -0.41950457, 0.02450215, 0.64028344],[-0.64953314, 0.65669916, 0.30836196, -0.85127779],[-0.41917583, 0.59043266, 0.40364532, -0.58144102],[ 0.8067843, 0.07947386, 0.74016181, 0.53165395],[-0.1613146 , 0.95678152, 0.62107101, -0.4215393 ]])In [18]: out # 沿列对应于绝对 argmax 的元素出[18]:数组([0.8067843,0.95678152,0.74016181,-0.85127779])

运行时测试 -

在[36]中:A = np.random.rand(4000,40)在 [37] 中:B = np.random.rand(4000,144)在 [38]: np.allclose(org_app(A,B),proposed_app(A,B))输出[38]:真在 [39]: %timeit org_app(A,B) # 原始方法1 个循环,最好的 3 个:每个循环 1.35 秒在 [40]: %timeit proposal_app(A,B) # 提出的向量化方法10 个循环,最好的 3 个:每个循环 39.1 毫秒

In Python I need to find the pairwise correlation between all features in a matrix A and all features in a matrix B. In particular, I am interesting in finding the strongest Pearson correlation that a given feature in A has across all features in B. I do not care whether the strongest correlation is positive or negative.

I've done a inefficient implementation using two loops and scipy below. However, I'd like to use np.corrcoef or another similar method to compute it efficiently. Matrix A has shape 40000x400 and B has shape 40000x1440. My attempt at doing it efficiently can be seen below as the method find_max_absolute_corr(A,B). However, it fails with the following error:

ValueError: all the input array dimensions except for the concatenation axis must match exactly.

import numpy as np
from scipy.stats import pearsonr


def find_max_absolute_corr(A, B):
    """ Finds for each feature in `A` the highest Pearson
        correlation across all features in `B`. """

    max_corr_A = np.zeros((A.shape[1]))    

    for A_col in range(A.shape[1]):
        print "Calculating {}/{}.".format(A_col+1, A.shape[1])

        metric = A[:,A_col]
        pearson = np.corrcoef(B, metric, rowvar=0)

        # takes negative correlations into account as well
        min_p = min(pearson)
        max_p = max(pearson)
        max_corr_A[A_col] = max_absolute(min_p, max_p)

    return max_corr_A


def max_absolute(min_p, max_p):
    if np.isnan(min_p) or np.isnan(max_p):
        raise ValueError("NaN correlation.")
    if abs(max_p) > abs(min_p):
        return max_p
    else:
        return min_p


if __name__ == '__main__':

    A = np.array(
        [[10, 8.04, 9.14, 7.46],
         [8, 6.95, 8.14, 6.77],
         [13, 7.58, 8.74, 12.74],
         [9, 8.81, 8.77, 7.11],
         [11, 8.33, 9.26, 7.81]])

    B = np.array(
        [[-14, -9.96, 8.10, 8.84, 8, 7.04], 
         [-6, -7.24, 6.13, 6.08, 5, 5.25], 
         [-4, -4.26, 3.10, 5.39, 8, 5.56], 
         [-12, -10.84, 9.13, 8.15, 5, 7.91], 
         [-7, -4.82, 7.26, 6.42, 8, 6.89]])

    # simple, inefficient method
    for A_col in range(A.shape[1]): 
        high_corr = 0
        for B_col in range(B.shape[1]):
            corr,_ = pearsonr(A[:,A_col], B[:,B_col])
            high_corr = max_absolute(high_corr, corr)
        print high_corr

    # -0.161314601631
    # 0.956781516149
    # 0.621071009239
    # -0.421539304112        

    # efficient method
    max_corr_A = find_max_absolute_corr(A, B)
    print max_corr_A

    # [-0.161314601631,
    # 0.956781516149,
    # 0.621071009239,
    # -0.421539304112]  

解决方案

Seems scipy.stats.pearsonr follows this definition of Pearson Correlation Coefficient Formula applied on column-wise pairs from A & B -

Based on that formula, you can vectorized easily as the pairwise computations of columns from A and B are independent of each other. Here's one vectorized solution using broadcasting -

# Get number of rows in either A or B
N = B.shape[0]

# Store columnw-wise in A and B, as they would be used at few places
sA = A.sum(0)
sB = B.sum(0)

# Basically there are four parts in the formula. We would compute them one-by-one
p1 = N*np.einsum('ij,ik->kj',A,B)
p2 = sA*sB[:,None]
p3 = N*((B**2).sum(0)) - (sB**2)
p4 = N*((A**2).sum(0)) - (sA**2)

# Finally compute Pearson Correlation Coefficient as 2D array 
pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))

# Get the element corresponding to absolute argmax along the columns 
out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]

Sample run -

1) Inputs :

In [12]: A
Out[12]: 
array([[ 10.  ,   8.04,   9.14,   7.46],
       [  8.  ,   6.95,   8.14,   6.77],
       [ 13.  ,   7.58,   8.74,  12.74],
       [  9.  ,   8.81,   8.77,   7.11],
       [ 11.  ,   8.33,   9.26,   7.81]])

In [13]: B
Out[13]: 
array([[-14.  ,  -9.96,   8.1 ,   8.84,   8.  ,   7.04],
       [ -6.  ,  -7.24,   6.13,   6.08,   5.  ,   5.25],
       [ -4.  ,  -4.26,   3.1 ,   5.39,   8.  ,   5.56],
       [-12.  , -10.84,   9.13,   8.15,   5.  ,   7.91],
       [ -7.  ,  -4.82,   7.26,   6.42,   8.  ,   6.89]])

2) Original loopy code run -

In [14]: high_corr_out = np.zeros(A.shape[1])
    ...: for A_col in range(A.shape[1]): 
    ...:     high_corr = 0
    ...:     for B_col in range(B.shape[1]):
    ...:         corr,_ = pearsonr(A[:,A_col], B[:,B_col])
    ...:         high_corr = max_absolute(high_corr, corr)
    ...:     high_corr_out[A_col] = high_corr
    ...:     

In [15]: high_corr_out
Out[15]: array([ 0.8067843 ,  0.95678152,  0.74016181, -0.85127779])

3) Proposed code run -

In [16]: N = B.shape[0]
    ...: sA = A.sum(0)
    ...: sB = B.sum(0)
    ...: p1 = N*np.einsum('ij,ik->kj',A,B)
    ...: p2 = sA*sB[:,None]
    ...: p3 = N*((B**2).sum(0)) - (sB**2)
    ...: p4 = N*((A**2).sum(0)) - (sA**2)
    ...: pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))
    ...: out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]
    ...: 

In [17]: pcorr # Pearson Correlation Coefficient array
Out[17]: 
array([[ 0.41895565, -0.5910935 , -0.40465987,  0.5818286 ],
       [ 0.66609445, -0.41950457,  0.02450215,  0.64028344],
       [-0.64953314,  0.65669916,  0.30836196, -0.85127779],
       [-0.41917583,  0.59043266,  0.40364532, -0.58144102],
       [ 0.8067843 ,  0.07947386,  0.74016181,  0.53165395],
       [-0.1613146 ,  0.95678152,  0.62107101, -0.4215393 ]])

In [18]: out # elements corresponding to absolute argmax along columns
Out[18]: array([ 0.8067843 ,  0.95678152,  0.74016181, -0.85127779])

Runtime tests -

In [36]: A = np.random.rand(4000,40)

In [37]: B = np.random.rand(4000,144)

In [38]: np.allclose(org_app(A,B),proposed_app(A,B))
Out[38]: True

In [39]: %timeit org_app(A,B) # Original approach
1 loops, best of 3: 1.35 s per loop

In [40]: %timeit proposed_app(A,B) # Proposed vectorized approach
10 loops, best of 3: 39.1 ms per loop

这篇关于两个特征矩阵的有效成对相关的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆