Python-如何通过计数数组加快余弦相似度 [英] Python - How to speed up cosine similarity with counting arrays
问题描述
我需要计算一个很大的集合的余弦相似度函数.此集合将用户和每个用户表示为对象ID的数组.下面的示例:
I need to compute the cosine similarity function across a very big set. This set represents users and each user as an array of object id. An example below:
user_1 = [1,4,6,100,3,1]
user_2 = [4,7,8,3,3,2,200,9,100]
如果我的理解是正确的,则要计算余弦相似度,我首先需要创建计数数组,以使每个数组具有相同的表示形式.然后,我需要计算余弦相似度函数.对于计数数组,我的意思是:
If my understanding is correct, to compute the cosine similarity I need first to create counting arrays to have a common representation for each of them. Then, I need to compute the cosine similarity function. For counting arrays I mean the following:
#user_1 array
# 1,2,3,4,5,6,[7-99],100,[101-200]
user_1_counting_array = [2,0,1,1,0,1,.......,1,.........]
user_2_counting_array = [0,1,2,1,0,0,1,1,1,.,1,.......,1]
(在这种情况下,点代表零)
(The dots represents zeros in this case)
获得这种通用表示后,我将使用sklearn的余弦相似度函数.
after I get this common representation I use the cosine similarity function from sklearn.
from scipy import spatial
s = 1 - spatial.distance.cosine(user_1_counting_array, user_2_counting_array)
问题在于,当我实际运行代码时,一切都非常慢,并且我的用户超过了1M.我知道这种结合会很多,但是我认为我如何创建通用表示会产生很大的瓶颈.
The problem is that when I actually run the code, everything is extremely slow, and I have more than 1M of user. I understand that the combination will be a lot, but I think that how I am creating the common representation generate a very big bottleneck.
为完整起见,以下是我的实现方式:
For completeness, the following represent my implementation:
from collections import Counter
from scipy import spatial
def fill_array(array, counter):
for c in counter:
array[c] = counter[c]
return array
user_1 = [1,4,6,100,3,1]
user_2 = [4,7,8,3,3,2,200,9,100]
user_1_c = Counter(user_1)
user_2_c = Counter(user_2)
if max(user_1_c) > max(user_2_c):
max_a = max(user_1_c)+1
else:
max_a = max(user_2_c)+1
user_1_c_array = [0]*max_a
user_2_c_array = [0]*max_a
fill_array(user_1_c_array, user_1_c)
fill_array(user_2_c_array, user_2_c)
result = 1 - spatial.distance.cosine(user_1_c_array, user_2_c_array)
推荐答案
在这里,您可以在不循环一百万个条目的情况下获得简短简洁的余弦相似度向量:
Here's how you can get your short and concise cosine similarity vectors without looping over a million entries:
user_1 = [1,4,6,100,3,1]
user_2 = [4,7,8,3,3,2,200,9,100]
# Create a list of unique elements
uniq = list(set(user_1 + user_2))
# Map all unique entrees in user_1 and user_2
duniq = {k:0 for k in uniq}
def create_vector(duniq, l):
dx = duniq.copy()
dx.update(Counter(l)) # Count the values
return list(dx.values()) # Return a list
u1 = create_vector(duniq, user_1)
u2 = create_vector(duniq, user_2)
# u1, u2:
u1 = [2, 0, 1, 1, 1, 0, 0, 0, 0, 1]
u2 = [0, 1, 2, 1, 0, 1, 1, 1, 1, 1]
然后您可以将这两个向量输入spatial.distance.cosine
You can then feed these 2 vectors into spatial.distance.cosine
这篇关于Python-如何通过计数数组加快余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!