两个数组之间的余弦距离计算-Python [英] Cosine distance computation between two arrays - Python

查看:806
本文介绍了两个数组之间的余弦距离计算-Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想应用一个函数fn,该函数本质上是对按行分别为(10000,100)和(5000,100)的两个大型numpy数组进行cosine distance计算,即我为每个组合计算一个值这些数组中的行数.

I want to apply a function fn, which is essentially cosine distance computation on two large numpy arrays of shapes (10000, 100) and (5000, 100) row-wise, i.e. i calculate a value for each combination of rows in these arrays.

我的实现:

import math
def fn(v1,v2):
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy/math.sqrt(sumxx*sumyy)
val = []
for i in range(array1.shape[0]):
    for j in range(array2.shape[0]):
        val.append(fn(array1[i, :], array2[j, :]))

该功能非常快速,只需几毫秒:

The function is very fast and takes only few ms:

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 1.24 ms

有什么有效的方法吗?

推荐答案

方法1::我们可以简单地使用

Approach #1 : We could simply use Scipy's cdist with its cosine distance functionality -

from scipy.spatial.distance import cdist

val_out = 1 - cdist(array1, array2, 'cosine')

方法2::使用 方法3::使用 np.einsum 来计算另一项的自平方和-

Approach #3 : Using np.einsum to compute the self squared summations for another one -

def cosine_vectorized_v2(array1, array2):
    sumyy = np.einsum('ij,ij->i',array2,array2)
    sumxx = np.einsum('ij,ij->i',array1,array1)[:,None]
    sumxy = array1.dot(array2.T)
    return (sumxy/np.sqrt(sumxx))/np.sqrt(sumyy)

方法#4:引入 模块来卸载square-root计算的另一种方法-

Approach #4 : Bringing in numexpr module to offload the square-root computations for another method -

import numexpr as ne

def cosine_vectorized_v3(array1, array2):
    sumyy = np.einsum('ij,ij->i',array2,array2)
    sumxx = np.einsum('ij,ij->i',array1,array1)[:,None]
    sumxy = array1.dot(array2.T)
    sqrt_sumxx = ne.evaluate('sqrt(sumxx)')
    sqrt_sumyy = ne.evaluate('sqrt(sumyy)')
    return ne.evaluate('(sumxy/sqrt_sumxx)/sqrt_sumyy')


运行时测试

# Using same sizes as stated in the question
In [185]: array1 = np.random.rand(10000,100)
     ...: array2 = np.random.rand(5000,100)
     ...: 

In [194]: %timeit 1 - cdist(array1, array2, 'cosine')
1 loops, best of 3: 366 ms per loop

In [195]: %timeit cosine_vectorized(array1, array2)
1 loops, best of 3: 287 ms per loop

In [196]: %timeit cosine_vectorized_v2(array1, array2)
1 loops, best of 3: 283 ms per loop

In [197]: %timeit cosine_vectorized_v3(array1, array2)
1 loops, best of 3: 217 ms per loop

这篇关于两个数组之间的余弦距离计算-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆