Scipy 余弦相似度与 sklearn 余弦相似度 [英] Scipy cosine similarity vs sklearn cosine similarity

查看:119
本文介绍了Scipy 余弦相似度与 sklearn 余弦相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到 scipysklearn 都有余弦相似度/余弦距离函数.我想测试每个向量对的速度:

I noticed that both scipy and sklearn have a cosine similarity/cosine distance functions. I wanted to test the speed for each on pairs of vectors:

setup1 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]"
setup2 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]"

import1 = "from sklearn.metrics.pairwise import cosine_similarity"
stmt1 = "[float(cosine_similarity(arr1.reshape(1,-1), arr2.reshape(1,-1))) for arr1, arr2 in zip(arrs1, arrs2)]"

import2 = "from scipy.spatial.distance import cosine"
stmt2 = "[float(1 - cosine(arr1, arr2)) for arr1, arr2 in zip(arrs1, arrs2)]"

import timeit
print("sklearn: ", timeit.timeit(stmt1, setup=import1 + ";" + setup1, number=1000))
print("scipy:   ", timeit.timeit(stmt2, setup=import2 + ";" + setup2, number=1000))

sklearn:  11.072769448000145
scipy:    1.9755544730005568

sklearn 的运行速度几乎比 scipy 慢 10 倍(即使您删除了 sklearn 示例的数组重塑并生成已经具有正确形状的数据).我无法想象为什么一个比另一个慢得多?

sklearn runs almost 10 times slower than scipy (even if you remove the array reshape for the sklearn example and generate data that's already in the right shape). I can't imagine why one is significantly slower than the other?

推荐答案

如评论部分所述,我认为比较不公平主要是因为 sklearn.metrics.pairwise.cosine_similarity旨在比较给定输入二维数组中样本的成对距离/相似性.另一方面,scipy.spatial.distance.cosine 旨在计算两个一维数组的余弦距离.

As mentioned in the comments section, I don't think the comparison is fair mainly because the sklearn.metrics.pairwise.cosine_similarity is designed to compare pairwise distance/similarity of the samples in the given input 2-D arrays. On the other hand, scipy.spatial.distance.cosine is designed to compute cosine distance of two 1-D arrays.

也许更公平的比较是使用 scipy.spatial.distance.cdistsklearn.metrics.pairwise.cosine_similarity,其中两者都计算样本的成对距离给定的数组.然而,令我惊讶的是,这表明 sklearn 实现比 scipy 实现快得多(我目前没有对此的解释!).这是实验:

Maybe a more fair comparison is to use scipy.spatial.distance.cdist vs. sklearn.metrics.pairwise.cosine_similarity, where both computes pairwise distance of samples in the given arrays. However, to my surprise, that shows the sklearn implementation is much faster than the scipy implementation (which I don't have an explanation for that currently!). Here is the experiment:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import cdist

x = np.random.rand(1000,1000)
y = np.random.rand(1000,1000)

def sklearn_cosine():
    return cosine_similarity(x, y)

def scipy_cosine():
    return 1. - cdist(x, y, 'cosine')

# Make sure their result is the same.
assert np.allclose(sklearn_cosine(), scipy_cosine())

这是计时结果:

%timeit sklearn_cosine()
10 loops, best of 3: 74 ms per loop

%timeit scipy_cosine()
1 loop, best of 3: 752 ms per loop

这篇关于Scipy 余弦相似度与 sklearn 余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆