索引和查询postgreSQL中的高维数据 [英] indexing and query high dimensional data in postgreSQL

查看:140
本文介绍了索引和查询postgreSQL中的高维数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在高度尺寸中索引数据(可以使用[0,254]范围内的整数的128维向量):

I want to index data in height dimensions (128 dimensional vectors of integers in range of [0,254] are possible):

| id |      vector       |
|  1 | { 1, 0, ..., 254} |
|  2 | { 2, 128, ...,1}  |
|  . | { 1, 0, ..., 252} |
|  n | { 1, 2, ..., 251} |

我看到PostGIS实现了R-Trees。那么我可以在PostGIS中使用这些树来索引和查询Postgres中的多维向量吗?

I saw that PostGIS implemented R-Trees. So can I use these trees in PostGIS to index and query multidimensional vectors in Postgres?

我还看到有一个 int数组的索引实现

现在我有疑问关于如何执行查询。

我可以对整数数组执行knn-search和radius搜索吗?
也许我还必须定义自己的距离函数。这可能吗?我想使用曼哈顿距离(区块距离)进行查询。

Now I have questions about how to perform a query.
Can I perform a knn-search and a radius search on an integer array? Maybe I also must define my own distance function. Is this possible? I want to use the Manhattan distance (block distance) for my queries.

我也可以将我的向量表示为二进制字符串,格式为 v1; v2; ...; vn 。这有助于执行搜索吗?

I also can represent my vector as a binary string with the pattern v1;v2;...;vn. Does this help to perform the search?

例如,如果我有这两个字符串:

For example if I had these two string:

1;2;1;1
1;3;2;2

这两个字符串之间的结果/距离应为3.

The result / distance between these two strings should be 3.

推荐答案

或许更好的选择是立方体扩展,因为您感兴趣的区域不是单个整数,而是全向量。

Perhaps a better choice would be the cube extension, since your area of interest is not individual integer, but full vector.

Cube支持GiST索引,Postgres 9.6也会将KNN索引带到多维数据集,支持 euclidean,taxicab(又名曼哈顿)和chebishev距离

Cube supports GiST indexing, and Postgres 9.6 will also bring KNN indexing to cubes, supporting euclidean, taxicab (aka Manhattan) and chebishev distances.

它是有点烦人的9.6仍然在开发中,但是没有问题将多维数据集扩展的补丁向后移植到9.5并且我从经验中说出来。

It is a bit annoying that 9.6 is still in development, however there's no problem backporting patch for cube extension to 9.5 and I say that from experience.

希望128 dimensio ns仍然足以获得有意义的结果

Hopefully 128 dimensions will still be enough to get meaningful results.

怎么做?

首先有一个示例表:

create extension cube;
create table vectors (id serial, vector cube);

使用示例数据填充表格:

Populate table with example data:

insert into vectors select id, cube(ARRAY[round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000)]) from generate_series(1, 2000000) id;

然后尝试选择:

explain analyze SELECT * from vectors
order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=123352.07..123352.09 rows=10 width=76) (actual time=1705.499..1705.501 rows=10 loops=1)
   ->  Sort  (cost=123352.07..129852.07 rows=2600000 width=76) (actual time=1705.496..1705.497 rows=10 loops=1)
         Sort Key: (('(966, 82, 765, 343, 600, 718, 338, 505)'::cube <#> vector))
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Seq Scan on vectors  (cost=0.00..67167.00 rows=2600000 width=76) (actual time=0.038..998.864 rows=2600000 loops=1)
 Planning time: 0.172 ms
 Execution time: 1705.541 ms
(7 rows)

我们应该创建一个索引:

We should create an index:

create index vectors_vector_idx on vectors (vector);

是否有帮助:

explain analyze SELECT * from vectors
order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10;

--------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.41..1.93 rows=10 width=76) (actual time=41.339..143.915 rows=10 loops=1)
   ->  Index Scan using vectors_vector_idx on vectors  (cost=0.41..393704.41 rows=2600000 width=76) (actual time=41.336..143.902 rows=10 loops=1)
         Order By: (vector <#> '(966, 82, 765, 343, 600, 718, 338, 505)'::cube)
 Planning time: 0.146 ms
 Execution time: 145.474 ms
(5 rows)

8维,确实有帮助。

这篇关于索引和查询postgreSQL中的高维数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆