Postgres:一对多搜索浮点数组的余弦相似度索引 [英] Postgres: index on cosine similarity of float arrays for one-to-many search

查看:191
本文介绍了Postgres:一对多搜索浮点数组的余弦相似度索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

余弦相似度定义为点积除以范数的乘积。

Cosine similarity between two equally-sized vectors (of reals) is defined as the dot product divided by the product of the norms.

为了表示向量,我有一个很大的 float 数组表,例如创建表foo(vec float []) 。给定某个 float 数组,我需要通过余弦相似度快速(使用索引而不是seqscan)在该表中找到最接近的数组,例如 SELECT *从foo ORDER BY cos_sim(vec,ARRAY [1.0,4.5,2.2])DESC LIMIT 10; 但是我该怎么用?

To represent vectors, I have a large table of float arrays, e.g. CREATE TABLE foo(vec float[])'. Given a certain float array, I need to quickly (with an index, not a seqscan) find the closest arrays in that table by cosine similarity, e.g. SELECT * FROM foo ORDER BY cos_sim(vec, ARRAY[1.0, 4.5, 2.2]) DESC LIMIT 10; But what do I use?

pg_trgm 的余弦相似度支持不同。它会比较文字,但我不确定它到底能做什么。名为 smlar 的扩展名(此处)也对浮点数组具有余弦相似度支持,但又做了一些不同的事情。我描述的内容通常用于数据分析以比较文档的功能,因此我认为Postgres对此提供了支持。

pg_trgm's cosine similarity support is different. It compares text, and I'm not sure what it does exactly. An extension called smlar (here) also has cosine similarity support for float arrays but again is doing something different. What I described is commonly used in data analysis to compare features of documents, so I was thinking there'd be support in Postgres for it.

推荐答案

我收集到没有扩展名可以做到这一点,所以我发现了一个有限的解决方法:

I gather that no extension that does this, so I've found a limited workaround:

如果A和B都被标准化(长度1), code> cos(A,B)= 1-0.5 * || A-B || ^ 2 || A-B || 是欧几里得距离,而 cos(A,B)是余弦相似度。因此,更大的欧几里得距离< =>会降低余弦相似度(如果您想像一个单位圆,这很直观),并且如果您具有非法线向量,则在不改变其方向的情况下更改其大小不会影响其余弦相似度。太好了,因此我可以对向量进行归一化并比较它们的欧几里得距离...

If A and B are both normalized (length 1), cos(A, B) = 1 - 0.5 * ||A - B||^2. ||A - B|| is the Euclidean distance, and cos(A, B) is the cosine similarity. So greater Euclidean distance <=> lesser cosine similarity (makes sense intuitively if you imagine a unit circle), and if you have non-normal vectors, changing their magnitudes without changing their directions doesn't affect their cosine similarities. Great, so I can normalize my vectors and compare their Euclidean distances...

对于立方体,它支持n维点和 Euclidean 距离上的GiST索引,但仅支持100个或更少的维(可以被黑客入侵,但是我遇到的问题大约在135以上,所以现在恐怕了。还需要Postgres 9.6或更高版本。

There's a nice answer here about Cube, which supports n-dimensional points and GiST indexes on Euclidean distance, but it only supports 100 or fewer dimensions (can be hacked higher, but I had issues around 135 and higher, so now I'm afraid). Also requires Postgres 9.6 or later.

因此:


  1. 请确保我不不在乎最多有100个尺寸。升级到Postgres 9.6或更高版本。

  2. 用数组填充表以表示向量。

  3. 对向量进行归一化以创建<$ c的额外列$ c> cube 分。在此列上创建GiST索引。

  4. 按欧几里得距离顺序递增,以使余弦相似度递减: EXPLAIN SELECT * FROM mytable ORDER BY normalized<- cube(array [1,2,3,4,5,6,7,8,9,0])限制10;

  1. Make sure I don't care about having at most 100 dimensions. Upgrade to Postgres 9.6 or later.
  2. Fill my table with arrays to represent vectors.
  3. Normalize the vectors to create an extra column of cube points. Create a GiST index on this column.
  4. Order by Euclidean distance ascending to get cosine similarity descending: EXPLAIN SELECT * FROM mytable ORDER BY normalized <-> cube(array[1,2,3,4,5,6,7,8,9,0]) LIMIT 10;

如果我需要100个以上的维度,则可以使用多个索引列来实现。在这种情况下,将更新答案。

If I need more than 100 dimensions, I might be able to achieve this using multiple indexed columns. Will update the answer in that case.

更新:相当肯定,我无法将> 100维向量拆分为多列。我最终不得不扫描整个表。

Update: Pretty sure there's nothing I can do with splitting the >100-dimension vector into multiple columns. I end up having to scan the entire table.

这篇关于Postgres:一对多搜索浮点数组的余弦相似度索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆