两个ndarray之间的余弦相似度 [英] Cosine similarity between two ndarrays

查看:121
本文介绍了两个ndarray之间的余弦相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个numpy数组,第一个数组的大小为100 * 4 * 200,第二个数组的大小为150 * 6 * 200.实际上,我将在数组1中存储100个4个字段的200维矢量表示的样本,并在数组2中存储140个6个字段的200维矢量表示的样本.

I have two numpy arrays, first array is of size 100*4*200, and second array is of size 150*6*200. In fact, I am storing the 100 samples of 200 dimensional vector representations of 4 fields in array 1 and 140 samples of 200 dimensional vectors of 6 fields in array 2.

现在,我想计算样本之间的相似度向量,并创建一个相似度矩阵.对于每个样本,我想计算每个字段组合之间的相似度并将其存储起来,这样我就得到了一个15000 * 24维数组.

Now I want to compute the similarity vector between the samples and create a similarity matrix. For each sample, I would like to calculate the similarity between the each combination of fields and store it such that I get a 15000*24 dimensional array.

前150行将是数组1的第一行与数组2的150行之间的相似度矢量,接下来150行将是数组1的第二行与数组2的150行之间的相似度矢量,依此类推.每个相似度向量是数组1中的#个字段*数组2中的#个字段,即相似性向量的第一个元素是数组1的字段1和数组2的字段1之间的余弦相似度,第二个元素将是数组1的字段1之间的相似度数组2的字段2,以此类推,最后一个元素,则是数组1的最后一个字段与数组2的最后一个字段之间的相似性.

First 150 rows will be the similarity vector between 1st row of array 1 and 150 rows of array 2, next 150 rows will be the similarity vector between the 2nd row of array 1 and 150 rows of array 2 etc. Each similarity vector is # fields in array 1 * # fields in array 2 i.e. 1st element of the similarity vector is cosine similarity between field 1 of array 1 and field 1 of array 2, 2nd element will be the similarity between field 1 of array 1 and field 2 of array 2 and so on with last element is the similarity between last field of array 1 and last field of array 2.

使用numpy数组实现此目的的最佳方法是什么?

What is the best way to do this using numpy arrays ?

推荐答案

因此,每个行"(我假设第一个轴,我将其称为轴0)都是示例轴.这意味着您从一个向量中获得了100个样本,每个样本都具有 x dimentions 4 x 200个字段.

So every "row" (i assume the first axis, that I'll call axis 0) is the sample axis. That means you have 100 samples from one vector, each with fieldsxdimentions 4x200.

按照您描述的方式进行操作,则第一个数组的第一行将具有(4,200),然后第二个数组将具有(150,6,200).然后,您要在(m,n)(m,n,k)数组之间进行 cos 距离,没有任何意义(这里最接近点积的就是张量积,我很确定这不是您想要的).

Doing this the way you describe, then the first row of the first array would have (4,200) and the second one would then have (150,6,200). Then you'd want to do a cos distance between an (m,n), and (m,n,k) array, which does not make sense (the closest you have to a dot product here would be the tensor product, which I'm fairly sure is not what you want).

所以我们必须先提取它们,然后遍历所有其他对象.

So we have to extract these first and then iterate over all the others.

为此,我实际上建议仅使用

To do this I actually recomend just splitting the array with np.split and iterate over both of them. This is just because I've never come across a faster way in numpy. You could use tensorflow to gain efficiency, but I'm not going into that here in my answer.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
a = np.random.rand(100, 4, 200)
b = np.random.rand(150, 6, 200)
# We know the output will be 150*100 x 6*4
c = np.empty([15000, 24])

# Make an array with the rows of a and same for b
a_splitted=np.split(a, a.shape[0], 0)
b_splitted=np.split(b, b.shape[0], 0)
i=0
for alpha in a_splitted:
    for beta in b_splitted:
        # Gives a 4x6 matrix
        sim=cosine_similarity(alpha[0],beta[0])
        c[i,:]=sim.ravel()
        i+=1

对于上述相似性功能,我只选择了@StefanFalk建议的内容:

For the similarity-function above I just chose what @StefanFalk sugested: sklearn.metrics.pairwise.cosine_similarity. If this similarity measure is not sufficient, then you could either write your own.

我一点也不声称这是在所有python中做到这一点的最佳方法.我认为,最有效的方法是使用 tensorflow .

I am not at all claiming that this is the best way to do this in all of python. I think the most efficient way is to do this symbolically using, as mentioned, tensorflow.

无论如何,希望对您有所帮助!

Anyways, hope it helps!

这篇关于两个ndarray之间的余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆