Python优化了大多数余弦相似向量 [英] Python Optimized Most Cosine Similar Vector

查看:121
本文介绍了Python优化了大多数余弦相似向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约30,000个向量,每个向量有大约300个元素.

I have about 30,000 vectors and each vector has about 300 elements.

对于另一个具有相同数字元素的向量,如何有效地找到最多(余弦)相似的向量?

For another vector (with same number elements), how can I efficiently find the most (cosine) similar vector?

以下是使用python循环的一种实现方式:

This following is one implementation using a python loop:

from time import time
import numpy as np

vectors = np.load("np_array_of_about_30000_vectors.npy")
target = np.load("single_vector.npy")
print vectors.shape, vectors.dtype  # (35196, 312) float3
print target.shape, target.dtype  # (312,) float32

start_time = time()
for i, candidate in enumerate(vectors):
    similarity = np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
    if similarity > max_similarity: 
        max_similarity = similarity 
        max_index = i
print "done with loop in %s seconds" % (time() - start_time)  # 0.466356039047 seconds
print "Most similar vector to target is index %s with %s" % (max_index, max_similarity)  #  index 2399 with 0.772758982696

以下删除了python循环的速度提高了44倍,但计算方法却不同:

The following with removed python loop is 44x faster, but isn't the same computation:

print "starting max dot"
start_time = time()
print(np.max(np.dot(vectors, target)))
print "done with max dot in %s seconds" % (time() - start_time)  # 0.0105748176575 seconds

是否有一种方法可以使与numpy进行迭代相关的加速不失去最大索引逻辑和普通乘积的除法?为了优化这样的计算,仅使用C语言进行计算是否有意义?

Is there a way to get this speedup associated with numpy doing the iterations without loosing the max index logic and the division of the normal product? For optimizing calculations like this, would it make sense to just do the calculations in C?

推荐答案

您对避免循环获取性能有正确的想法.您可以使用argmin来获取最小距离索引.

You have the correct idea about avoiding the loop to get performance. You can use argmin to get the minimum distance index.

不过,我会将距离计算更改为scipy

Though, I would change the distance calculation to scipy cdist as well. This way you can calculate distances to multiple targets and would be able to choose from several distance metrics, if need be.

import numpy as np
from scipy.spatial import distance

distances = distance.cdist([target], vectors, "cosine")[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
max_similarity = 1 - min_distance

HTH.

这篇关于Python优化了大多数余弦相似向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆