多维多维数据集上的Postgresql k最近邻居(KNN) [英] Postgresql k-nearest neighbor (KNN) on multidimensional cube
问题描述
我有一个具有8个维度的多维数据集.我想做最近的邻居匹配.我对Postgresql完全陌生.我读到9.1支持多维上的最近邻居匹配.如果有人可以举一个完整的例子,我将不胜感激:
I have a cube that has 8 dimensions. I want to do nearest neighbor matching. I'm totally new to postgresql. I read that 9.1 supports nearest neighbor matching on multidimensions. I'd really appreciate if someone could give a complete example:
-
如何使用8D多维数据集创建表?
How to create a table with the 8D cube ?
示例插入
查找-完全匹配
查找-最近邻居匹配
样本数据:
为简单起见,我们可以假定所有值的范围都是0-100.
For simplicity sake, we can assume that all the values range from 0-100.
Point1:(1,1,1,1,1,1,1,1)
Point1: (1,1,1,1, 1,1,1,1)
第2点:(2,2,2,2,2,2,2,2)
Point2: (2,2,2,2, 2,2,2,2)
查找值:(1,1,1,1,1,1,1,2)
Look up value: (1,1,1,1, 1,1,1,2)
这应该与Point1而不是Point2相匹配.
This should match against Point1 and not Point2.
参考:
https://en.wikipedia.org/wiki/K-d_tree#Nearest_neighbour_search
推荐答案
PostgreSQL支持距离运算符<->
,据我了解,它可以用于分析文本(使用pg_trgrm模块)和
PostgreSQL supports distance operator <->
and as I understand it, this can be used for analyzing text (with pg_trgrm module) and geometry data type.
我不知道如何在超过1个维度上使用它.也许您将必须定义自己的距离函数,或者以某种方式将数据转换为具有文本或几何类型的一列.例如,如果您的表具有8列(8维多维数据集):
I do not know how you can use it with more than 1 dimension. Maybe you will have to define your own distance function or somehow convert your data to one column with text or geometry type. For example if you have table with 8 columns (8-dimensional cube):
c1 c2 c3 c4 c5 c6 c7 c8
1 0 1 0 1 0 1 2
您可以将其转换为:
c1 c2 c3 c4 c5 c6 c7 c8
a b a b a b a c
然后到具有一列的表:
c1
abababac
然后您就可以使用(在创建gist
索引):
Then you can use (after creating gist
index):
SELECT c1, c1 <-> 'ababab'
FROM test_trgm
ORDER BY c1 <-> 'ababab';
示例
创建样本数据
-- Create some temporary data
-- ! Note that table are created in tmp schema (change sql to your scheme) and deleted if exists !
drop table if exists tmp.test_data;
-- Random integer matrix 100*8
create table tmp.test_data as (
select
trunc(random()*100)::int as input_variable_1,
trunc(random()*100)::int as input_variable_2,
trunc(random()*100)::int as input_variable_3,
trunc(random()*100)::int as input_variable_4,
trunc(random()*100)::int as input_variable_5,
trunc(random()*100)::int as input_variable_6,
trunc(random()*100)::int as input_variable_7,
trunc(random()*100)::int as input_variable_8
from
generate_series(1,100,1)
);
将输入数据转换为文本
drop table if exists tmp.test_data_trans;
create table tmp.test_data_trans as (
select
input_variable_1 || ';' ||
input_variable_2 || ';' ||
input_variable_3 || ';' ||
input_variable_4 || ';' ||
input_variable_5 || ';' ||
input_variable_6 || ';' ||
input_variable_7 || ';' ||
input_variable_8 as trans_variable
from
tmp.test_data
);
这将为您提供一个变量trans_variable
,其中存储了所有8个维度:
This will give you one variable trans_variable
where all the 8 dimensions are stored:
trans_variable
40;88;68;29;19;54;40;90
80;49;56;57;42;36;50;68
29;13;63;33;0;18;52;77
44;68;18;81;28;24;20;89
80;62;20;49;4;87;54;18
35;37;32;25;8;13;42;54
8;58;3;42;37;1;41;49
70;1;28;18;47;78;8;17
代替||
运算符,您还可以使用以下语法(更简短,但更隐秘):
Instead of ||
operator you can also use the following syntax (shorter, but more cryptic):
select
array_to_string(string_to_array(t.*::text,''),'') as trans_variable
from
tmp.test_data t
添加索引
create index test_data_gist_index on tmp.test_data_trans using gist(trans_variable);
测试距离
注意:我从表-52;42;18;50;68;29;8;55
中选择了一行,并使用稍有变化的值(42;42;18;52;98;29;8;55
)来测试距离.当然,您的测试数据中的值将完全不同,因为它是RANDOM矩阵.
Test distance
Note: I've selected one row from table - 52;42;18;50;68;29;8;55
- and used slightly changed value (42;42;18;52;98;29;8;55
) to test the distance. Of course, you will have completely different values in your test data, because it is RANDOM matrix.
select
*,
trans_variable <-> '42;42;18;52;98;29;8;55' as distance,
similarity(trans_variable, '42;42;18;52;98;29;8;55') as similarity,
from
tmp.test_data_trans
order by
trans_variable <-> '52;42;18;50;68;29;8;55';
您可以使用距离运算符<->或类似函数.距离= 1-相似度
You can use distance operator <-> or similiarity function. Distance = 1 - Similarity
这篇关于多维多维数据集上的Postgresql k最近邻居(KNN)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!