具有 numpy 的大型稀疏矩阵的余弦相似度 [英] cosine similarity on large sparse matrix with numpy

查看:41
本文介绍了具有 numpy 的大型稀疏矩阵的余弦相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的代码导致我的系统在完成之前耗尽内存.

The code below causes my system to run out of memory before it completes.

您能否提出一种更有效的方法来计算大矩阵的余弦相似度,例如下面的方法?

Can you suggest a more efficient means of computing the cosine similarity on a large matrix, such as the one below?

我想计算原始矩阵 (mat) 中 65000 行中的每一行相对于所有其他行的余弦相似度,以便结果是一个 65000 x 65000 矩阵,其中每个元素是原始矩阵中两行之间的余弦相似度.

I would like to have the cosine similarity computed for each of the 65000 rows in my original matrix (mat) relative to all of the others so that the result is a 65000 x 65000 matrix where each element is the cosine similarity between two rows in the original matrix.

import numpy as np
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity

mat = np.random.rand(65000, 10)

sparse_mat = sparse.csr_matrix(mat)

similarities = cosine_similarity(sparse_mat)

运行最后一行后,我总是内存不足,程序要么死机,要么崩溃,并出现 MemoryError.无论我是在 8 GB 本地 RAM 还是 64 GB EC2 实例上运行,都会发生这种情况.

After running that last line I always run out of memory and the program either freezes or crashes with a MemoryError. This occurs whether I run on my 8 gb local RAM or on a 64 gb EC2 instance.

推荐答案

同样的问题.我有一个很大的非稀疏矩阵.它很适合内存,但是 cosine_similarity 因任何未知原因而崩溃,可能是因为它们在某处复制矩阵一次太多.所以我让它比较左边"的小批量行而不是整个矩阵:

Same problem here. I've got a big, non-sparse matrix. It fits in memory just fine, but cosine_similarity crashes for whatever unknown reason, probably because they copy the matrix one time too many somewhere. So I made it compare small batches of rows "on the left" instead of the entire matrix:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def cosine_similarity_n_space(m1, m2, batch_size=100):
    assert m1.shape[1] == m2.shape[1]
    ret = np.ndarray((m1.shape[0], m2.shape[0]))
    for row_i in range(0, int(m1.shape[0] / batch_size) + 1):
        start = row_i * batch_size
        end = min([(row_i + 1) * batch_size, m1.shape[0]])
        if end <= start:
            break # cause I'm too lazy to elegantly handle edge cases
        rows = m1[start: end]
        sim = cosine_similarity(rows, m2) # rows is O(1) size
        ret[start: end] = sim
    return ret

对我来说没有崩溃;天啊.尝试不同的批量大小以使其更快.我曾经一次只比较 1 行,而在我的机器上花费了大约 30 倍的时间.

No crashes for me; YMMV. Try different batch sizes to make it faster. I used to only compare 1 row at a time, and it took about 30X as long on my machine.

愚蠢而有效的健全性检查:

Stupid yet effective sanity check:

import random
while True:
    m = np.random.rand(random.randint(1, 100), random.randint(1, 100))
    n = np.random.rand(random.randint(1, 100), m.shape[1])
    assert np.allclose(cosine_similarity(m, n), cosine_similarity_n_space(m, n))

这篇关于具有 numpy 的大型稀疏矩阵的余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆