大型数据集的余弦相似度 [英] Cosine similarity for very large dataset

查看:414
本文介绍了大型数据集的余弦相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在计算大量100维向量之间的余弦相似度时遇到麻烦.使用from sklearn.metrics.pairwise import cosine_similarity时,我在16 GB的计算机上得到了MemoryError.每个数组都非常适合我的内存,但是在np.dot()内部调用

I am having trouble with calculating cosine similarity between large list of 100-dimensional vectors. When I use from sklearn.metrics.pairwise import cosine_similarity, I get MemoryError on my 16 GB machine. Each array fits perfectly in my memory but I get MemoryError during np.dot() internal call

这是我的用例以及当前的处理方式.

Here's my use-case and how I am currently tackling it.

这是我的100维父向量,我需要将其与相同尺寸(即100)的其他500,000个不同向量进行比较

Here's my parent vector of 100-dimension which I need to compare with other 500,000 different vectors of same dimension (i.e. 100)

parent_vector = [1, 2, 3, 4 ..., 100]

这是我的子向量(此示例中包含一些虚构的随机数)

Here are my child vectors (with some made-up random numbers for this example)

child_vector_1 = [2, 3, 4, ....., 101]
child_vector_2 = [3, 4, 5, ....., 102]
child_vector_3 = [4, 5, 6, ....., 103]
.......
.......
child_vector_500000 = [3, 4, 5, ....., 103]

我的最终目标是获得与父向量的余弦相似度非常高的前N个子向量(其名称,如child_vector_1及其对应的余弦分数).

My final goal is to get top-N child vectors (with their names such as child_vector_1 and their corresponding cosine score) who have very high cosine similarity with the parent vector.

我目前的方法(我知道这种方法效率低下且消耗内存):

My current approach (which I know is inefficient and memory consuming):

步骤1:创建具有以下形状的超级数据框

Step 1: Create a super-dataframe of following shape

parent_vector         1,    2,    3, .....,    100   
child_vector_1        2,    3,    4, .....,    101   
child_vector_2        3,    4,    5, .....,    102   
child_vector_3        4,    5,    6, .....,    103   
......................................   
child_vector_500000   3,    4,    5, .....,    103

第2步:使用

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df)

获取所有向量之间的成对余弦相似性(如上数据框所示)

to get pair-wise cosine similarity between all vectors (shown in above dataframe)

第3步:列出一个元组列表,以存储所有组合的key(例如child_vector_1)和值(例如余弦相似度数字).

Step 3: Make a list of tuple to store the key such as child_vector_1 and value such as the cosine similarity number for all such combinations.

第4步:使用列表的sort()获得前N位-这样我就可以得到子向量名称以及它与父向量的余弦相似度得分.

Step 4: Get the top-N using sort() of list -- so that I get the child vector name as well as its cosine similarity score with the parent vector.

PS:我知道这是非常低效的,但我想不到更好的方法 快速计算每个子向量之间的余弦相似度的方法 和父向量,并获得前N个值.

PS: I know this is highly inefficient but I couldn't think of a better way to faster compute cosine similarity between each of child vector and parent vector and get the top-N values.

任何帮助将不胜感激.

推荐答案

即使您的(500000,100)数组(父级及其子级)适合内存 它上的任何成对度量都不会.这样做的原因是顾名思义,成对度量可计算任意两个子代的距离.为了存储这些距离,您将需要一个(500000,500000)大小的浮点数组,如果我的计算正确的话,它将需要大约100 GB的内存.

even though your (500000, 100) array (the parent and its children) fits into memory any pairwise metric on it won't. The reason for that is that pairwise metric as the name suggests computes the distance for any two children. In order to store these distances you would need a (500000,500000) sized array of floats which if my calculations are right would take about 100 GB of memory.

非常感谢您为您的问题提供了一个简单的解决方案.如果我对您的理解正确,那么您只希望孩子和父母之间有一段距离,这将导致长度500000的向量易于存储在内存中.

Thankfully there is an easy solution for your problem. If I understand you correctly you only want to have the distance between child and parents which will result in a vector of length 500000 which is easily stored in memory.

为此,您只需要向cosine_similarity提供第二个参数,仅包含parent_vector

To do this, you simply need to provide a second argument to cosine_similarity containing only the parent_vector

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

df = pd.DataFrame(np.random.rand(500000,100)) 
df['distances'] = cosine_similarity(df, df.iloc[0:1]) # Here I assume that the parent vector is stored as the first row in the dataframe, but you could also store it separately

n = 10 # or however many you want
n_largest = df['distances'].nlargest(n + 1) # this contains the parent itself as the most similar entry, hence n+1 to get n children

希望能解决您的问题.

这篇关于大型数据集的余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆