需要SQL优化(也许是DISTINCT ON的原因吗?) [英] Need SQL optimization (maybe DISTINCT ON is the reason?)

查看：108 发布时间：2020/5/29 19:37:37 sql postgresql query-optimization aggregate-functions postgis

本文介绍了需要SQL优化(也许是DISTINCT ON的原因吗?)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Related, preceding question:
Select a random entry from a group after grouping by a value (not column)?

我当前的查询如下:

WITH
  points AS (
    SELECT unnest(array_of_points) AS p
  ),

 gtps AS (
   SELECT DISTINCT ON(points.p)
     points.p, m.groundtruth
   FROM measurement m, points
   WHERE st_distance(m.groundtruth, points.p) < distance
   ORDER BY points.p, RANDOM()
 )

SELECT DISTINCT ON(gtps.p, gtps.groundtruth, m.anchor_id)
  m.id, m.anchor_id, gtps.groundtruth, gtps.p
FROM measurement m, gtps
ORDER BY gtps.p, gtps.groundtruth, m.anchor_id, RANDOM()

语义:

有两个输入值:

there are two input values:

第4行:点array_of_points
第12行:双精度数字:distance

Line 4: an array of Points array_of_points
Line 12: a double precision number: distance

第一段(第1-6行):

First paragraph (lines 1-6):

从points数组创建一个表，用于...

第二段(第8-14行):

Second paragraph (lines 8-14):

对于points表中的每个点:从measurement表中获得一个随机(！)groundtruth点，其距离为<. distance
将这些元组保存在gtps表中

For each point inside the points table: get a random(!) groundtruth point from the measurement table, that has a distance < distance
Save those tuples inside the gtps table

第三段(第16-19行):

Third paragraph (lines 16-19):

对于gtps表中的每个groundtruth值:获取所有anchor_id值和...
如果anchor_id值不是唯一的:则选择一个随机值

For each groundtruth value inside the gtps table: get all anchor_id values and...
If an anchor_id value is not unique: Then choose a random one

输出:id，anchor_id，groundtruth，p(来自array_of_points的输入值)

Output: id, anchor_id, groundtruth, p (input value from the array_of_points)

表格示例:

id | anchor_id | groundtruth | data
-----------------------------------
1  | 1         | POINT(1 4)  | ...
2  | 3         | POINT(1 4)  | ...
3  | 8         | POINT(1 4)  | ...
4  | 6         | POINT(1 4)  | ...
-----------------------------------
5  | 2         | POINT(3 2)  | ...
6  | 4         | POINT(3 2)  | ...
-----------------------------------
7  | 1         | POINT(4 3)  | ...
8  | 1         | POINT(4 3)  | ...
9  | 6         | POINT(4 3)  | ...
10 | 7         | POINT(4 3)  | ...
11 | 3         | POINT(4 3)  | ...
-----------------------------------
12 | 1         | POINT(6 2)  | ...
13 | 5         | POINT(6 2)  | ...

结果示例:

id  | anchor_id | groundtruth | p
-----------------------------------------
1   | 1         | POINT(1 4)  | POINT(1 0)
2   | 3         | POINT(1 4)  | POINT(1 0)
4   | 6         | POINT(1 4)  | POINT(1 0)
3   | 8         | POINT(1 4)  | POINT(1 0)
5   | 2         | POINT(3 2)  | POINT(2 2)
6   | 4         | POINT(3 2)  | POINT(2 2)
1   | 1         | POINT(1 4)  | POINT(4 8)
2   | 3         | POINT(1 4)  | POINT(4 8)
4   | 6         | POINT(1 4)  | POINT(4 8)
3   | 8         | POINT(1 4)  | POINT(4 8)
12  | 1         | POINT(6 2)  | POINT(7 3)
13  | 5         | POINT(6 2)  | POINT(7 3)
1   | 1         | POINT(4 3)  | POINT(9 1)
11  | 3         | POINT(4 3)  | POINT(9 1)
9   | 6         | POINT(4 3)  | POINT(9 1)
10  | 7         | POINT(4 3)  | POINT(9 1)

如您所见:

每个输入值可以具有多个相等的groundtruth值.
如果输入值具有多个groundtruth值，则这些值必须全部相等.
每个groundtruth-inputPoint-tuple与该groundtruth的每个possilbe anchor_id连接.
两个不同的输入值可以具有相同的相应groundtruth值.
两个不同的groundtruth-inputPoint-tuple可以具有相同的anchor_id
两个相同的Groundtruth-inputPoint-tuple必须具有不同的anchor_id s

Each input value can have multiple equal groundtruth values.
If an input value has multiple groundtruth values, those must all be equal.
Each groundtruth-inputPoint-tuple is connected with every possilbe anchor_id for that groundtruth.
Two different input values can have the same corresponding groundtruth value.
Two distinct groundtruth-inputPoint-tuples can have the same anchor_id
Two indentical groundtruth-inputPoint-tuples must have different anchor_ids

基准(用于两个输入值):

Benchmarks (for two input values):

1-6行:16ms
第8-14行:48ms
第16-19行:600ms

说明详细信息:

Unique  (cost=11119.32..11348.33 rows=18 width=72)
  Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, (random())
  CTE points
    ->  Result  (cost=0.00..0.01 rows=1 width=0)
          Output: unnest('{0101000000EE7C3F355EF24F4019390B7BDA011940:01010000003480B74082FA44402CD49AE61D173C40}'::geometry[])
  CTE gtps
    ->  Unique  (cost=7659.95..7698.12 rows=1 width=160)
          Output: points.p, m.groundtruth, (random())
          ->  Sort  (cost=7659.95..7679.04 rows=7634 width=160)
                Output: points.p, m.groundtruth, (random())
                Sort Key: points.p, (random())
                ->  Nested Loop  (cost=0.00..6565.63 rows=7634 width=160)
                      Output: points.p, m.groundtruth, random()
                      Join Filter: (st_distance(m.groundtruth, points.p) < m.distance)
                      ->  CTE Scan on points  (cost=0.00..0.02 rows=1 width=32)
                            Output: points.p
                      ->  Seq Scan on public.measurement m  (cost=0.00..535.01 rows=22901 width=132)
                            Output: m.id, m.anchor_id, m.tag_node_id, m.experiment_id, m.run_id, m.anchor_node_id, m.groundtruth, m.distance, m.distance_error, m.distance_truth, m."timestamp"
  ->  Sort  (cost=3421.18..3478.43 rows=22901 width=72)
        Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, (random())
        Sort Key: gtps.p, gtps.groundtruth, m.anchor_id, (random())
        ->  Nested Loop  (cost=0.00..821.29 rows=22901 width=72)
              Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, random()
              ->  CTE Scan on gtps  (cost=0.00..0.02 rows=1 width=64)
                    Output: gtps.p, gtps.groundtruth
              ->  Seq Scan on public.measurement m  (cost=0.00..535.01 rows=22901 width=8)
                    Output: m.id, m.anchor_id, m.tag_node_id, m.experiment_id, m.run_id, m.anchor_node_id, m.groundtruth, m.distance, m.distance_error, m.distance_truth, m."timestamp"

解释分析:

Unique  (cost=11119.32..11348.33 rows=18 width=72) (actual time=548.991..657.992 rows=36 loops=1)
  CTE points
    ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.004..0.011 rows=2 loops=1)
  CTE gtps
    ->  Unique  (cost=7659.95..7698.12 rows=1 width=160) (actual time=133.416..146.745 rows=2 loops=1)
          ->  Sort  (cost=7659.95..7679.04 rows=7634 width=160) (actual time=133.415..142.255 rows=15683 loops=1)
                Sort Key: points.p, (random())
                Sort Method: external merge  Disk: 1248kB
                ->  Nested Loop  (cost=0.00..6565.63 rows=7634 width=160) (actual time=0.045..46.670 rows=15683 loops=1)
                      Join Filter: (st_distance(m.groundtruth, points.p) < m.distance)
                      ->  CTE Scan on points  (cost=0.00..0.02 rows=1 width=32) (actual time=0.007..0.020 rows=2 loops=1)
                      ->  Seq Scan on measurement m  (cost=0.00..535.01 rows=22901 width=132) (actual time=0.013..3.902 rows=22901 loops=2)
  ->  Sort  (cost=3421.18..3478.43 rows=22901 width=72) (actual time=548.989..631.323 rows=45802 loops=1)
        Sort Key: gtps.p, gtps.groundtruth, m.anchor_id, (random())"
        Sort Method: external merge  Disk: 4008kB
        ->  Nested Loop  (cost=0.00..821.29 rows=22901 width=72) (actual time=133.449..166.294 rows=45802 loops=1)
              ->  CTE Scan on gtps  (cost=0.00..0.02 rows=1 width=64) (actual time=133.420..146.753 rows=2 loops=1)
              ->  Seq Scan on measurement m  (cost=0.00..535.01 rows=22901 width=8) (actual time=0.014..4.409 rows=22901 loops=2)
Total runtime: 834.626 ms

实时运行时，应以大约100-1000个输入值运行.因此，目前大约需要35到350秒.

When running live this should run with about 100-1000 input values. So for now it would take 35 to 350 seconds which is far to much.

我已经尝试删除RANDOM()功能.这将运行时间(对于2个输入值)从大约670ms减少到大约530ms.因此，目前这还不是主要的影响.

I already tried to remove the RANDOM() functions. This decreases the runtime (for 2 input values) from about 670ms to about 530ms. So this isn't the main impact at the moment.

如果更容易/更快，也可以运行2或3个独立的查询并在软件中做一些工作(它在Ruby on Rails服务器上运行).例如随机选择?！

It's also possible to run 2 or 3 separate queries and do some parts in software (it's running on a Ruby on Rails server) if that's easier/faster. For example the random selection?!

SELECT
  m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
  measurement m
JOIN
  (SELECT unnest(point_array) AS p) AS ps
  ON ST_DWithin(ps.p, m.groundtruth, distance)
GROUP BY groundtruth, ps.p

使用此查询，查询速度非常快( 15ms )，但有很多不足之处:

With this query it is extremely fast (15ms), but there's missing a lot:

每个ps.p
两个数组彼此属于.意思是:里面物品的顺序很重要！
这两个数组需要过滤(随机):
对于出现多次的数组中的每个anchor_id:保留一个随机值，然后删除所有其他值.这也意味着从每个删除的anchor_id

id

I just need a random row for each ps.p
The two arrays belong to each other. Means: the order of the items inside is important!
Those two arrays need to get filtered (randomly):
For each anchor_id in the array that appears more than once: keep a random one and delete all other. This also means to remove the corresponding id from the id-array for every deleted anchor_id

如果anchor_id和id可以存储在元组数组中，那也很好.例如:{[4,1],[6,3],[4,2],[8,5],[4,4]}(约束:每个元组都是唯一的，每个id(示例中==第2个值)都是唯一的，anchor_id不是唯一的).本示例显示查询，其中没有必须仍然应用的过滤器.应用了过滤器后，它看起来就像这样{[6,3],[4,4],[8,5]}.

It would also be nice if anchor_id and id could be stored inside an array of tuples. For example: {[4,1],[6,3],[4,2],[8,5],[4,4]} (constraints: every tuple is unique, every id (== 2nd value in the example) is unique, anchor_ids are not unique). This example displays the query without the filters that still must be applied. With the filters applied, it would look like this {[6,3],[4,4],[8,5]}.

SELECT DISTINCT ON (ps.p)
  m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
  measurement m
JOIN
  (SELECT unnest(point_array) AS p) AS ps
  ON ST_DWithin(ps.p, m.groundtruth, distance)
GROUP BY ps.p, m.groundtruth
ORDER BY ps.p, RANDOM()

这现在给出了很好的结果，并且仍然非常快: 16ms
剩下要做的一件事:

This now give pretty nice results and is still very fast: 16ms
There's just one thing left to do:

ARRAY_AGG(m.anchor_id)已经被随机化，但是:
它包含很多重复的条目，所以:
我想在其上使用DISTINCT之类的东西，但是
它必须与ARRAY_AGG(m.id)同步.这意味着:
如果DISTINCT命令保留anchor_id数组的索引1、4和7，那么它还必须保留id数组的索引1、4和7(当然还要删除所有其他索引)

ARRAY_AGG(m.anchor_id) is already randomized, but:
it contains lots of duplicate entries, so:
I'd like to use something like DISTINCT on it, but:
it has to be synchronized with ARRAY_AGG(m.id). This means:
If the DISTINCT command keeps the indices 1, 4 and 7 of the anchor_id array, then it has also to keep indices 1, 4 and 7 of the id array (and of course delete all others)

多维数组的聚集函数

我想您为此创建了一个二维数组.这比ARRAY of record更容易处理.标准array_agg()无法聚合多维数组.但是您可以为此轻松地编写自己的聚合函数:

Aggreagate function for multi-dimensional arrays

I suppose you create a two-dimensional array for that. That's easier to handle than an ARRAY of record. Standard array_agg() cannot aggregate multi-dimensional arrays. But you can write your own aggregate function rather easily for that:

CREATE AGGREGATE array_agg_mult (anyarray)  (
    SFUNC     = array_cat
   ,STYPE     = anyarray
   ,INITCOND  = '{}'
);

阅读此相关答案中的说明:
选择数据到Postgres数组中

Read the explanation in this related answer:
Selecting data into a Postgres array

对于出现多次的数组中的每个anchor_id:保留一个随机选择一个，然后删除其他所有.这也意味着删除 ID数组中每个已删除的anchor_id对应的ID

For each anchor_id in the array that appears more than once: keep a random one and delete all other. This also means to remove the corresponding id from the id-array for every deleted anchor_id

查询

SELECT DISTINCT ON (p)
       p, groundtruth, array_agg_mult(ARRAY[ARRAY[anchor_id, id]]) AS ids
FROM (
   SELECT DISTINCT ON (ps.p, m.groundtruth, m.anchor_id)
          ps.p, m.groundtruth, m.anchor_id, m.id
   FROM  (SELECT unnest(point_array) AS p) AS ps
   JOIN   measurement m ON ST_DWithin(ps.p, m.groundtruth, distance)
   ORDER  BY ps.p, m.groundtruth, m.anchor_id, random()
   ) x
GROUP  BY p, groundtruth
ORDER  BY p, random();

子查询x每个(p, groundtruth)获得不同的anchor_id，如果有多个对等项，则选择一个随机行.这样，连接anchor_id - id保持不变.
- Subquery x gets distinct anchor_id per (p, groundtruth) and picks a random row if there are multiple peers. This way the connection anchor_id - id stays intact.
  
  外部查询按照您的期望聚集一个二维数组，由anchor_id排序.如果您要随机订购anchor_id，请再次使用random:
  
  The outer query aggregates a 2-dimensional array like you wished for, ordered by anchor_id. If you want to have anchor_id ordered randomly, use random once more:
```
array_agg_mult(ARRAY[ARRAY[anchor_id, id]] ORDER BY random())
```
- 最后，DISTINCT ON每个p仅选择1个groundtruth，再次随机选择.
- And finally, the DISTINCT ON picks only 1 groundtruth per p, randomly again.
  
  这篇关于需要SQL优化(也许是DISTINCT ON的原因吗?)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

需要SQL优化(也许是DISTINCT ON的原因吗?) [英] Need SQL optimization (maybe DISTINCT ON is the reason?)

问题描述

推荐答案

多维数组的聚集函数

Aggreagate function for multi-dimensional arrays

查询

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

需要SQL优化(也许是DISTINCT ON的原因吗?) [英] Need SQL optimization (maybe DISTINCT ON is the reason?)

问题描述

推荐答案

多维数组的聚集函数

Aggreagate function for multi-dimensional arrays

查询

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭