需要SQL优化(也许是DISTINCT ON的原因吗?) [英] Need SQL optimization (maybe DISTINCT ON is the reason?)
问题描述
相关的先前的问题:
选择随机条目是按值(而不是列)分组后从组中选择的?
Related, preceding question:
Select a random entry from a group after grouping by a value (not column)?
我当前的查询如下:
WITH
points AS (
SELECT unnest(array_of_points) AS p
),
gtps AS (
SELECT DISTINCT ON(points.p)
points.p, m.groundtruth
FROM measurement m, points
WHERE st_distance(m.groundtruth, points.p) < distance
ORDER BY points.p, RANDOM()
)
SELECT DISTINCT ON(gtps.p, gtps.groundtruth, m.anchor_id)
m.id, m.anchor_id, gtps.groundtruth, gtps.p
FROM measurement m, gtps
ORDER BY gtps.p, gtps.groundtruth, m.anchor_id, RANDOM()
语义:
-
有两个输入值:
there are two input values:
- 第4行:点
array_of_points
的数组
- 第12行:双精度数字:
distance
- Line 4: an array of Points
array_of_points
- Line 12: a double precision number:
distance
第一段(第1-6行):
First paragraph (lines 1-6):
- 从points数组创建一个表,用于...
第二段(第8-14行):
Second paragraph (lines 8-14):
- 对于
points
表中的每个点:从measurement
表中获得一个随机(!)groundtruth
点,其距离为<.distance
- 将这些元组保存在
gtps
表中
- For each point inside the
points
table: get a random(!)groundtruth
point from themeasurement
table, that has a distance <distance
- Save those tuples inside the
gtps
table
第三段(第16-19行):
Third paragraph (lines 16-19):
- 对于
gtps
表中的每个groundtruth
值:获取所有anchor_id
值和... - 如果
anchor_id
值不是唯一的:则选择一个随机值
- For each
groundtruth
value inside thegtps
table: get allanchor_id
values and... - If an
anchor_id
value is not unique: Then choose a random one
输出:id
,anchor_id
,groundtruth
,p
(来自array_of_points
的输入值)
Output: id
, anchor_id
, groundtruth
, p
(input value from the array_of_points
)
表格示例:
id | anchor_id | groundtruth | data
-----------------------------------
1 | 1 | POINT(1 4) | ...
2 | 3 | POINT(1 4) | ...
3 | 8 | POINT(1 4) | ...
4 | 6 | POINT(1 4) | ...
-----------------------------------
5 | 2 | POINT(3 2) | ...
6 | 4 | POINT(3 2) | ...
-----------------------------------
7 | 1 | POINT(4 3) | ...
8 | 1 | POINT(4 3) | ...
9 | 6 | POINT(4 3) | ...
10 | 7 | POINT(4 3) | ...
11 | 3 | POINT(4 3) | ...
-----------------------------------
12 | 1 | POINT(6 2) | ...
13 | 5 | POINT(6 2) | ...
结果示例:
id | anchor_id | groundtruth | p
-----------------------------------------
1 | 1 | POINT(1 4) | POINT(1 0)
2 | 3 | POINT(1 4) | POINT(1 0)
4 | 6 | POINT(1 4) | POINT(1 0)
3 | 8 | POINT(1 4) | POINT(1 0)
5 | 2 | POINT(3 2) | POINT(2 2)
6 | 4 | POINT(3 2) | POINT(2 2)
1 | 1 | POINT(1 4) | POINT(4 8)
2 | 3 | POINT(1 4) | POINT(4 8)
4 | 6 | POINT(1 4) | POINT(4 8)
3 | 8 | POINT(1 4) | POINT(4 8)
12 | 1 | POINT(6 2) | POINT(7 3)
13 | 5 | POINT(6 2) | POINT(7 3)
1 | 1 | POINT(4 3) | POINT(9 1)
11 | 3 | POINT(4 3) | POINT(9 1)
9 | 6 | POINT(4 3) | POINT(9 1)
10 | 7 | POINT(4 3) | POINT(9 1)
如您所见:
- 每个输入值可以具有多个相等的
groundtruth
值. - 如果输入值具有多个
groundtruth
值,则这些值必须全部相等. - 每个groundtruth-inputPoint-tuple与该groundtruth的每个possilbe
anchor_id
连接. - 两个不同的输入值可以具有相同的相应
groundtruth
值. - 两个不同的groundtruth-inputPoint-tuple可以具有相同的
anchor_id
- 两个相同的Groundtruth-inputPoint-tuple必须具有不同的
anchor_id
s
- Each input value can have multiple equal
groundtruth
values. - If an input value has multiple
groundtruth
values, those must all be equal. - Each groundtruth-inputPoint-tuple is connected with every possilbe
anchor_id
for that groundtruth. - Two different input values can have the same corresponding
groundtruth
value. - Two distinct groundtruth-inputPoint-tuples can have the same
anchor_id
- Two indentical groundtruth-inputPoint-tuples must have different
anchor_id
s
基准(用于两个输入值):
Benchmarks (for two input values):
- 1-6行:16ms
- 第8-14行:48ms
- 第16-19行:600ms
说明详细信息:
Unique (cost=11119.32..11348.33 rows=18 width=72)
Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, (random())
CTE points
-> Result (cost=0.00..0.01 rows=1 width=0)
Output: unnest('{0101000000EE7C3F355EF24F4019390B7BDA011940:01010000003480B74082FA44402CD49AE61D173C40}'::geometry[])
CTE gtps
-> Unique (cost=7659.95..7698.12 rows=1 width=160)
Output: points.p, m.groundtruth, (random())
-> Sort (cost=7659.95..7679.04 rows=7634 width=160)
Output: points.p, m.groundtruth, (random())
Sort Key: points.p, (random())
-> Nested Loop (cost=0.00..6565.63 rows=7634 width=160)
Output: points.p, m.groundtruth, random()
Join Filter: (st_distance(m.groundtruth, points.p) < m.distance)
-> CTE Scan on points (cost=0.00..0.02 rows=1 width=32)
Output: points.p
-> Seq Scan on public.measurement m (cost=0.00..535.01 rows=22901 width=132)
Output: m.id, m.anchor_id, m.tag_node_id, m.experiment_id, m.run_id, m.anchor_node_id, m.groundtruth, m.distance, m.distance_error, m.distance_truth, m."timestamp"
-> Sort (cost=3421.18..3478.43 rows=22901 width=72)
Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, (random())
Sort Key: gtps.p, gtps.groundtruth, m.anchor_id, (random())
-> Nested Loop (cost=0.00..821.29 rows=22901 width=72)
Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, random()
-> CTE Scan on gtps (cost=0.00..0.02 rows=1 width=64)
Output: gtps.p, gtps.groundtruth
-> Seq Scan on public.measurement m (cost=0.00..535.01 rows=22901 width=8)
Output: m.id, m.anchor_id, m.tag_node_id, m.experiment_id, m.run_id, m.anchor_node_id, m.groundtruth, m.distance, m.distance_error, m.distance_truth, m."timestamp"
解释分析:
Unique (cost=11119.32..11348.33 rows=18 width=72) (actual time=548.991..657.992 rows=36 loops=1)
CTE points
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.004..0.011 rows=2 loops=1)
CTE gtps
-> Unique (cost=7659.95..7698.12 rows=1 width=160) (actual time=133.416..146.745 rows=2 loops=1)
-> Sort (cost=7659.95..7679.04 rows=7634 width=160) (actual time=133.415..142.255 rows=15683 loops=1)
Sort Key: points.p, (random())
Sort Method: external merge Disk: 1248kB
-> Nested Loop (cost=0.00..6565.63 rows=7634 width=160) (actual time=0.045..46.670 rows=15683 loops=1)
Join Filter: (st_distance(m.groundtruth, points.p) < m.distance)
-> CTE Scan on points (cost=0.00..0.02 rows=1 width=32) (actual time=0.007..0.020 rows=2 loops=1)
-> Seq Scan on measurement m (cost=0.00..535.01 rows=22901 width=132) (actual time=0.013..3.902 rows=22901 loops=2)
-> Sort (cost=3421.18..3478.43 rows=22901 width=72) (actual time=548.989..631.323 rows=45802 loops=1)
Sort Key: gtps.p, gtps.groundtruth, m.anchor_id, (random())"
Sort Method: external merge Disk: 4008kB
-> Nested Loop (cost=0.00..821.29 rows=22901 width=72) (actual time=133.449..166.294 rows=45802 loops=1)
-> CTE Scan on gtps (cost=0.00..0.02 rows=1 width=64) (actual time=133.420..146.753 rows=2 loops=1)
-> Seq Scan on measurement m (cost=0.00..535.01 rows=22901 width=8) (actual time=0.014..4.409 rows=22901 loops=2)
Total runtime: 834.626 ms
实时运行时,应以大约100-1000个输入值运行.因此,目前大约需要35到350秒.
When running live this should run with about 100-1000 input values. So for now it would take 35 to 350 seconds which is far to much.
我已经尝试删除RANDOM()
功能.这将运行时间(对于2个输入值)从大约670ms减少到大约530ms.因此,目前这还不是主要的影响.
I already tried to remove the RANDOM()
functions. This decreases the runtime (for 2 input values) from about 670ms to about 530ms. So this isn't the main impact at the moment.
如果更容易/更快,也可以运行2或3个独立的查询并在软件中做一些工作(它在Ruby on Rails服务器上运行).例如随机选择?!
It's also possible to run 2 or 3 separate queries and do some parts in software (it's running on a Ruby on Rails server) if that's easier/faster. For example the random selection?!
SELECT
m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
measurement m
JOIN
(SELECT unnest(point_array) AS p) AS ps
ON ST_DWithin(ps.p, m.groundtruth, distance)
GROUP BY groundtruth, ps.p
使用此查询,查询速度非常快( 15ms ),但有很多不足之处:
With this query it is extremely fast (15ms), but there's missing a lot:
- 每个
ps.p
我只需要一个随机行
- 两个数组彼此属于.意思是:里面物品的顺序很重要!
- 这两个数组需要过滤(随机):
对于出现多次的数组中的每个anchor_id
:保留一个随机值,然后删除所有其他值.这也意味着从每个删除的anchor_id
的
id
数组中删除相应的id
- I just need a random row for each
ps.p
- The two arrays belong to each other. Means: the order of the items inside is important!
- Those two arrays need to get filtered (randomly):
For eachanchor_id
in the array that appears more than once: keep a random one and delete all other. This also means to remove the correspondingid
from theid
-array for every deletedanchor_id
如果anchor_id
和id
可以存储在元组数组中,那也很好.例如:{[4,1],[6,3],[4,2],[8,5],[4,4]}
(约束:每个元组都是唯一的,每个id(示例中==第2个值)都是唯一的,anchor_id不是唯一的).本示例显示查询,其中没有必须仍然应用的过滤器.应用了过滤器后,它看起来就像这样{[6,3],[4,4],[8,5]}
.
It would also be nice if anchor_id
and id
could be stored inside an array of tuples. For example: {[4,1],[6,3],[4,2],[8,5],[4,4]}
(constraints: every tuple is unique, every id (== 2nd value in the example) is unique, anchor_ids are not unique). This example displays the query without the filters that still must be applied. With the filters applied, it would look like this {[6,3],[4,4],[8,5]}
.
SELECT DISTINCT ON (ps.p)
m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
measurement m
JOIN
(SELECT unnest(point_array) AS p) AS ps
ON ST_DWithin(ps.p, m.groundtruth, distance)
GROUP BY ps.p, m.groundtruth
ORDER BY ps.p, RANDOM()
这现在给出了很好的结果,并且仍然非常快: 16ms
剩下要做的一件事:
This now give pretty nice results and is still very fast: 16ms
There's just one thing left to do:
-
ARRAY_AGG(m.anchor_id)
已经被随机化,但是: - 它包含很多重复的条目,所以:
- 我想在其上使用DISTINCT之类的东西,但是
- 它必须与
ARRAY_AGG(m.id)
同步.这意味着:
如果DISTINCT命令保留anchor_id
数组的索引1、4和7,那么它还必须保留id
数组的索引1、4和7(当然还要删除所有其他索引)
ARRAY_AGG(m.anchor_id)
is already randomized, but:- it contains lots of duplicate entries, so:
- I'd like to use something like DISTINCT on it, but:
- it has to be synchronized with
ARRAY_AGG(m.id)
. This means:
If the DISTINCT command keeps the indices 1, 4 and 7 of theanchor_id
array, then it has also to keep indices 1, 4 and 7 of theid
array (and of course delete all others)
推荐答案
如果anchor_id和id可以存储在元组数组中,那也很好.
It would also be nice if anchor_id and id could be stored inside an array of tuples.
多维数组的聚集函数
我想您为此创建了一个二维数组.这比ARRAY of record
更容易处理.标准array_agg()
无法聚合多维数组.但是您可以为此轻松地编写自己的聚合函数:
Aggreagate function for multi-dimensional arrays
I suppose you create a two-dimensional array for that. That's easier to handle than an ARRAY of record
. Standard array_agg()
cannot aggregate multi-dimensional arrays. But you can write your own aggregate function rather easily for that:
CREATE AGGREGATE array_agg_mult (anyarray) (
SFUNC = array_cat
,STYPE = anyarray
,INITCOND = '{}'
);
阅读此相关答案中的说明:
选择数据到Postgres数组中
Read the explanation in this related answer:
Selecting data into a Postgres array
对于出现多次的数组中的每个anchor_id:保留一个 随机选择一个,然后删除其他所有.这也意味着删除 ID数组中每个已删除的anchor_id对应的ID
For each anchor_id in the array that appears more than once: keep a random one and delete all other. This also means to remove the corresponding id from the id-array for every deleted anchor_id
查询
SELECT DISTINCT ON (p)
p, groundtruth, array_agg_mult(ARRAY[ARRAY[anchor_id, id]]) AS ids
FROM (
SELECT DISTINCT ON (ps.p, m.groundtruth, m.anchor_id)
ps.p, m.groundtruth, m.anchor_id, m.id
FROM (SELECT unnest(point_array) AS p) AS ps
JOIN measurement m ON ST_DWithin(ps.p, m.groundtruth, distance)
ORDER BY ps.p, m.groundtruth, m.anchor_id, random()
) x
GROUP BY p, groundtruth
ORDER BY p, random();
-
子查询
x
每个(p, groundtruth)
获得不同的anchor_id
,如果有多个对等项,则选择一个随机行.这样,连接anchor_id - id
保持不变.Subquery
x
gets distinctanchor_id
per(p, groundtruth)
and picks a random row if there are multiple peers. This way the connectionanchor_id - id
stays intact.外部查询按照您的期望聚集一个二维数组,由
anchor_id
排序.如果您要随机订购anchor_id
,请再次使用random:The outer query aggregates a 2-dimensional array like you wished for, ordered by
anchor_id
. If you want to haveanchor_id
ordered randomly, use random once more:array_agg_mult(ARRAY[ARRAY[anchor_id, id]] ORDER BY random())
-
最后,
DISTINCT ON
每个p
仅选择1个groundtruth
,再次随机选择. And finally, the
DISTINCT ON
picks only 1groundtruth
perp
, randomly again.这篇关于需要SQL优化(也许是DISTINCT ON的原因吗?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!