Python-如何通过计数数组加快余弦相似度 [英] Python - How to speed up cosine similarity with counting arrays

查看:253
本文介绍了Python-如何通过计数数组加快余弦相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要计算一个很大的集合的余弦相似度函数.此集合将用户和每个用户表示为对象ID的数组.下面的示例:

I need to compute the cosine similarity function across a very big set. This set represents users and each user as an array of object id. An example below:

user_1 = [1,4,6,100,3,1]
user_2 = [4,7,8,3,3,2,200,9,100]

如果我的理解是正确的,则要计算余弦相似度,我首先需要创建计数数组,以使每个数组具有相同的表示形式.然后,我需要计算余弦相似度函数.对于计数数组,我的意思是:

If my understanding is correct, to compute the cosine similarity I need first to create counting arrays to have a common representation for each of them. Then, I need to compute the cosine similarity function. For counting arrays I mean the following:

#user_1 array
#                        1,2,3,4,5,6,[7-99],100,[101-200]
user_1_counting_array = [2,0,1,1,0,1,.......,1,.........]
user_2_counting_array = [0,1,2,1,0,0,1,1,1,.,1,.......,1]

(在这种情况下,点代表零)

(The dots represents zeros in this case)

获得这种通用表示后,我将使用sklearn的余弦相似度函数.

after I get this common representation I use the cosine similarity function from sklearn.

from scipy import spatial
s = 1 - spatial.distance.cosine(user_1_counting_array, user_2_counting_array)

问题在于,当我实际运行代码时,一切都非常慢,并且我的用户超过了1M.我知道这种结合会很多,但是我认为我如何创建通用表示会产生很大的瓶颈.

The problem is that when I actually run the code, everything is extremely slow, and I have more than 1M of user. I understand that the combination will be a lot, but I think that how I am creating the common representation generate a very big bottleneck.

为完整起见,以下是我的实现方式:

For completeness, the following represent my implementation:

from collections import Counter
from scipy import spatial

def fill_array(array, counter):
    for c in counter:
        array[c] = counter[c]
    return array

user_1 = [1,4,6,100,3,1]
user_2 = [4,7,8,3,3,2,200,9,100]

user_1_c = Counter(user_1)
user_2_c = Counter(user_2)

if max(user_1_c) > max(user_2_c):
    max_a = max(user_1_c)+1
else:
    max_a = max(user_2_c)+1

user_1_c_array = [0]*max_a
user_2_c_array = [0]*max_a

fill_array(user_1_c_array, user_1_c)
fill_array(user_2_c_array, user_2_c)

result = 1 - spatial.distance.cosine(user_1_c_array, user_2_c_array)

推荐答案

在这里,您可以在不循环一百万个条目的情况下获得简短简洁的余弦相似度向量:

Here's how you can get your short and concise cosine similarity vectors without looping over a million entries:

user_1 = [1,4,6,100,3,1]
user_2 = [4,7,8,3,3,2,200,9,100]

# Create a list of unique elements
uniq = list(set(user_1 + user_2))

# Map all unique entrees in user_1 and user_2
duniq = {k:0 for k in uniq}

def create_vector(duniq, l):
    dx = duniq.copy()
    dx.update(Counter(l)) # Count the values
    return list(dx.values()) # Return a list

u1 = create_vector(duniq, user_1)
u2 = create_vector(duniq, user_2)

# u1, u2:

u1 = [2, 0, 1, 1, 1, 0, 0, 0, 0, 1]
u2 = [0, 1, 2, 1, 0, 1, 1, 1, 1, 1]

然后您可以将这两个向量输入spatial.distance.cosine

You can then feed these 2 vectors into spatial.distance.cosine

这篇关于Python-如何通过计数数组加快余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆