更快地计算特殊相关距离矩阵 [英] Calculate special correlation distance matrix faster

查看:103
本文介绍了更快地计算特殊相关距离矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Pearson相关距离建立距离矩阵. 我首先尝试了scipy.spatial.distance.pdist(df,'correlation'),它对于我的5000行* 20个特征数据集来说非常快.

I would like to build a distance matrix using Pearson correlation distance. I first tried the scipy.spatial.distance.pdist(df,'correlation') which is very fast for my 5000 rows * 20 features dataset.

由于我想创建一个推荐器,因此我想稍微改变距离,只考虑两个用户对于NaN而言都不同的功能.实际上,scipy.spatial.distance.pdist(df,'correlation')遇到任何值为float('nan')的特征时都会输出NaN.

Since I want to build a recommender, I wanted to slightly change the distance, only considering features which are distinct for NaN for both users. Indeed, scipy.spatial.distance.pdist(df,'correlation') output NaN when it meets any feature whose value is float('nan').

这是我的代码,df是我的5000 * 20大熊猫DataFrame

Here is my code, df being my 5000*20 pandas DataFrame

dist_mat = []
d = df.shape[1]
for i,row_i in enumerate(df.itertuples()):
    for j,row_j in enumerate(df.itertuples()):
        if i<j:
            print(i,j)
            ind = [False if (math.isnan(row_i[t+1]) or math.isnan(row_j[t+1])) else True for t in range(d)]
            dist_mat.append(scipy.spatial.distance.correlation([row_i[t] for t in ind],[row_j[t] for t in ind]))

此代码有效,但与scipy.spatial.distance.pdist(df,'correlation')相比,它慢得多.我的问题是:如何改进代码,使其运行速度更快?还是在哪里可以找到一个库来计算两个向量之间的相关性,而只考虑两个向量中都出现的特征?

This code works but it is ashtoningly slow compared to the scipy.spatial.distance.pdist(df,'correlation') one. My question is: how can I improve my code so it runs a lot faster? Or where can I find a library which calculates correlation between two vectors which only take in consideration features which appears in both of them?

谢谢您的回答.

推荐答案

我认为您需要使用Cython进行此操作,下面是一个示例:

I think you need to do this with Cython, here is an example:

#cython: boundscheck=False, wraparound=False, cdivision=True

import numpy as np

cdef extern from "math.h":
    bint isnan(double x)
    double sqrt(double x)

def pair_correlation(double[:, ::1] x):
    cdef double[:, ::] res = np.empty((x.shape[0], x.shape[0]))
    cdef double u, v
    cdef int i, j, k, count
    cdef double du, dv, d, n, r
    cdef double sum_u, sum_v, sum_u2, sum_v2, sum_uv

    for i in range(x.shape[0]):
        for j in range(i, x.shape[0]):
            sum_u = sum_v = sum_u2 = sum_v2 = sum_uv = 0.0
            count = 0            
            for k in range(x.shape[1]):
                u = x[i, k]
                v = x[j, k]
                if u == u and v == v:
                    sum_u += u
                    sum_v += v
                    sum_u2 += u*u
                    sum_v2 += v*v
                    sum_uv += u*v
                    count += 1
            if count == 0:
                res[i, j] = res[j, i] = -9999
                continue

            um = sum_u / count
            vm = sum_v / count
            n = sum_uv - sum_u * vm - sum_v * um + um * vm * count
            du = sqrt(sum_u2 - 2 * sum_u * um + um * um * count) 
            dv = sqrt(sum_v2 - 2 * sum_v * vm + vm * vm * count)
            r = 1 - n / (du * dv)
            res[i, j] = res[j, i] = r
    return res.base

要检查不包含NAN的输出,请执行以下操作:

To check the output without NAN:

import numpy as np
from scipy.spatial.distance import pdist, squareform, correlation
x = np.random.rand(2000, 20)
np.allclose(pair_correlation(x), squareform(pdist(x, "correlation")))

要使用NAN检查输出,请执行以下操作:

To check the output with NAN:

x = np.random.rand(2000, 20)
x[x < 0.3] = np.nan
r = pair_correlation(x)

i, j = 200, 60 # change this
mask = ~(np.isnan(x[i]) | np.isnan(x[j]))
u = x[i, mask]
v = x[j, mask]
assert abs(correlation(u, v) - r[i, j]) < 1e-12

这篇关于更快地计算特殊相关距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆