给定稀疏矩阵数据,Python中最快的计算余弦相似度的方法是什么? [英] What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

查看:260
本文介绍了给定稀疏矩阵数据,Python中最快的计算余弦相似度的方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个稀疏矩阵列表,计算矩阵中每一列(或每一行)之间的余弦相似度的最佳方法是什么?我宁愿不重复两次选择.

Given a sparse matrix listing, what's the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times.

假设输入矩阵为:

A= 
[0 1 0 0 1
 0 0 1 1 1
 1 1 0 1 0]

稀疏表示为:

A = 
0, 1
0, 4
1, 2
1, 3
1, 4
2, 0
2, 1
2, 3

在Python中,使用矩阵输入格式很简单:

In Python, it's straightforward to work with the matrix-input format:

import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine

A = np.array(
[[0, 1, 0, 0, 1],
[0, 0, 1, 1, 1],
[1, 1, 0, 1, 0]])

dist_out = 1-pairwise_distances(A, metric="cosine")
dist_out

赠予:

array([[ 1.        ,  0.40824829,  0.40824829],
       [ 0.40824829,  1.        ,  0.33333333],
       [ 0.40824829,  0.33333333,  1.        ]])

对于全矩阵输入来说很好,但是我真的想从稀疏表示开始(由于矩阵的大小和稀疏性).关于如何最好地实现这一点的任何想法?预先感谢.

That's fine for a full-matrix input, but I really want to start with the sparse representation (due to the size and sparsity of my matrix). Any ideas about how this could best be accomplished? Thanks in advance.

推荐答案

您可以直接使用sklearn在稀疏矩阵的行上计算成对的余弦相似度.从0.17版开始,它还支持稀疏输出:

You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output:

from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse

A =  np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]])
A_sparse = sparse.csr_matrix(A)

similarities = cosine_similarity(A_sparse)
print('pairwise dense output:\n {}\n'.format(similarities))

#also can output sparse matrices
similarities_sparse = cosine_similarity(A_sparse,dense_output=False)
print('pairwise sparse output:\n {}\n'.format(similarities_sparse))

结果:

pairwise dense output:
[[ 1.          0.40824829  0.40824829]
[ 0.40824829  1.          0.33333333]
[ 0.40824829  0.33333333  1.        ]]

pairwise sparse output:
(0, 1)  0.408248290464
(0, 2)  0.408248290464
(0, 0)  1.0
(1, 0)  0.408248290464
(1, 2)  0.333333333333
(1, 1)  1.0
(2, 1)  0.333333333333
(2, 0)  0.408248290464
(2, 2)  1.0

如果要按列进行余弦相似度,只需预先转置输入矩阵即可:

If you want column-wise cosine similarities simply transpose your input matrix beforehand:

A_sparse.transpose()

这篇关于给定稀疏矩阵数据,Python中最快的计算余弦相似度的方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆