Scipy 余弦相似度与 sklearn 余弦相似度 [英] Scipy cosine similarity vs sklearn cosine similarity
问题描述
我注意到 scipy
和 sklearn
都有余弦相似度/余弦距离函数.我想测试每个向量对的速度:
I noticed that both scipy
and sklearn
have a cosine similarity/cosine distance functions. I wanted to test the speed for each on pairs of vectors:
setup1 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]"
setup2 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]"
import1 = "from sklearn.metrics.pairwise import cosine_similarity"
stmt1 = "[float(cosine_similarity(arr1.reshape(1,-1), arr2.reshape(1,-1))) for arr1, arr2 in zip(arrs1, arrs2)]"
import2 = "from scipy.spatial.distance import cosine"
stmt2 = "[float(1 - cosine(arr1, arr2)) for arr1, arr2 in zip(arrs1, arrs2)]"
import timeit
print("sklearn: ", timeit.timeit(stmt1, setup=import1 + ";" + setup1, number=1000))
print("scipy: ", timeit.timeit(stmt2, setup=import2 + ";" + setup2, number=1000))
sklearn: 11.072769448000145
scipy: 1.9755544730005568
sklearn
的运行速度几乎比 scipy
慢 10 倍(即使您删除了 sklearn 示例的数组重塑并生成已经具有正确形状的数据).我无法想象为什么一个比另一个慢得多?
sklearn
runs almost 10 times slower than scipy
(even if you remove the array reshape for the sklearn example and generate data that's already in the right shape). I can't imagine why one is significantly slower than the other?
推荐答案
如评论部分所述,我认为比较不公平主要是因为 sklearn.metrics.pairwise.cosine_similarity
旨在比较给定输入二维数组中样本的成对距离/相似性.另一方面,scipy.spatial.distance.cosine
旨在计算两个一维数组的余弦距离.
As mentioned in the comments section, I don't think the comparison is fair mainly because the sklearn.metrics.pairwise.cosine_similarity
is designed to compare pairwise distance/similarity of the samples in the given input 2-D arrays. On the other hand, scipy.spatial.distance.cosine
is designed to compute cosine distance of two 1-D arrays.
也许更公平的比较是使用 scipy.spatial.distance.cdist
与 sklearn.metrics.pairwise.cosine_similarity
,其中两者都计算样本的成对距离给定的数组.然而,令我惊讶的是,这表明 sklearn 实现比 scipy 实现快得多(我目前没有对此的解释!).这是实验:
Maybe a more fair comparison is to use scipy.spatial.distance.cdist
vs. sklearn.metrics.pairwise.cosine_similarity
, where both computes pairwise distance of samples in the given arrays. However, to my surprise, that shows the sklearn implementation is much faster than the scipy implementation (which I don't have an explanation for that currently!). Here is the experiment:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import cdist
x = np.random.rand(1000,1000)
y = np.random.rand(1000,1000)
def sklearn_cosine():
return cosine_similarity(x, y)
def scipy_cosine():
return 1. - cdist(x, y, 'cosine')
# Make sure their result is the same.
assert np.allclose(sklearn_cosine(), scipy_cosine())
这是计时结果:
%timeit sklearn_cosine()
10 loops, best of 3: 74 ms per loop
%timeit scipy_cosine()
1 loop, best of 3: 752 ms per loop
这篇关于Scipy 余弦相似度与 sklearn 余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!