分组后按值(非列)从组中选择随机条目? [英] Select a random entry from a group after grouping by a value (not column)?
问题描述
我想用Postgres和PostGIS编写查询。我也使用Rails和 rgeo
, rgeo-activerecord
和 activerecord-postgis-adapter
,但Rails的东西并不重要。
表结构: 示例数据: 这张表是某种手动创建的 参数: 查询应该做什么: 1. 查询查找所有 SQL很容易: 2。现在我们有 3。接下来,我想通过随机选择其中一个(!)来消除相同的元组。为什么不简单地采取第一个?因为 选择SQL中的随机行: 我遇到的问题是:我可以想象使用SQL
<$ p
- int id
- int anchor_id
- Point groundtruth
- data(对查询不重要)
id | anchor_id | groundtruth |数据
-----------------------------------
1 | 1 | POINT(1 4)| ...
2 | 3 | POINT(1 4)| ...
3 | 2 | POINT(1 4)| ...
4 | 3 | POINT(1 4)| ...
-----------------------------------
5 | 2 | POINT(3 2)| ...
6 | 4 | POINT(3 2)| ...
-----------------------------------
7 | 1 | POINT(4 3)| ...
8 | 1 | POINT(4 3)| ...
9 | 1 | POINT(4 3)| ...
10 | 5 | POINT(4 3)| ...
11 | 3 | POINT(4 3)| ...
view
用于更快的查找(具有数百万行)。否则我们不得不加入8张桌子,否则它会变得更慢。但这不是问题的一部分。
简单版本:
p
d
groundtruth
具有距离的点<
p
WHERE st_distance(groundtruth,p)< d
groundtruth
点与他们的 anchor_id
s。正如你在上面的表格中看到的,可能有多个相同的groundtruth-anchor_id元组。例如: anchor_id = 3
和 groundtruth = POINT(1 4)
。
data
列不同。
SELECT。 .. ORDER BY RANDOM()LIMIT 1
LOOP
s和很多的子查询,但肯定有一个解决方案,使用 GROUP BY
或其他一些方法将使其更快。
完整版:
基本上与上面相同,区别在于:输入参数更改:
- 点数
p1
...p312456345
- 还有一个
d
如果简单查询正在工作,可以在SQL中使用 LOOP
来完成。但也许有一个更好的(和更快的)解决方案,因为数据库是真的很大!
解决方案
WITH ps AS(SELECT unnest(p_array)AS p)
SELECT DISTINCT ON(anchor_id,groundtruth)
*
FROM measurement m,ps
WHERE EXISTS(
SELECT 1
FROM ps
WHERE st_distance(m.groundtruth,ps.p)< d
)
ORDER BY anchor_id,groundtruth,random();
感谢Erwin Brandstetter!
为了消除重复,这可能是PostgreSQL中效率最高的查询:
SELECT DISTINCT ON (anchor_id,groundtruth)*
FROM measurement
WHERE st_distance(p,groundtruth)< d
关于这种查询风格的更多信息:
正如评论中提到的那样,这给你一个任意选择。如果您需要随机,则稍微贵一些:
SELECT DISTINCT ON(anchor_id,groundtruth)*
FROM measurement
WHERE st_distance(p,groundtruth)< d
ORDER BY anchor_id,groundtruth,random()
第二部分难以优化。 ** EXISTS * *
半连接可能是最快的选择。对于给定的表 ps(p点)
:
SELECT DISTINCT ON (anchor_id,groundtruth)*
FROM measurement m
WHERE EXISTS(
SELECT 1
FROM ps
WHERE st_distance(ps.p,m.groundtruth)< d
)
ORDER BY anchor_id,groundtruth,random();
只要一个 p
足够接近,它使查询的其余部分变得简单。
请务必用一个匹配的 GiST索引
。 b WITH ps AS SELECT unnest(p_array)AS p)
SELECT ...
根据评论更新
如果您只需要单行作为答案,则可以简化:
WITH ps AS(SELECT unnest(p_arra y)AS p)
SELECT *
FROM measurement m
WHERE EXISTS(
SELECT 1
FROM ps
WHERE st_distance(ps.p,m。 groundtruth) d
)
极限1;
加快 ST_DWithin()
用函数 ST_DWithin()
(和一个匹配的GiST索引!)。
获取一个在这里选择而不是CTE):
SELECT *
FROM measurement m
JOIN(SELECT unnest( p_array)AS p)ps ON ST_DWithin(ps.p,m.groundtruth,d)
LIMIT 1;
要获得所有点的一行 p $ c
FROM measurement m
JOIN(SELECT unnest(p_array)AS p)ps ON ST_DWithin(ps.p,m.groundtruth,d)
添加 ORDER BY random()
会使这个查询更加昂贵的。如果没有 random()
,Postgres可以从GiST索引中选择第一个匹配的行。
顺便说一下,<$ c 所有可能的匹配都必须随机检索和排序。 $ c> LIMIT 1 在 EXISTS
中是毫无意义的。在我提供的链接上阅读手册。或这个相关的问题。
I want to write a query using Postgres and PostGIS. I'm also using Rails with rgeo
, rgeo-activerecord
and activerecord-postgis-adapter
, but the Rails stuff is rather unimportant.
The table structure:
measurement
- int id
- int anchor_id
- Point groundtruth
- data (not important for the query)
Example data:
id | anchor_id | groundtruth | data
-----------------------------------
1 | 1 | POINT(1 4) | ...
2 | 3 | POINT(1 4) | ...
3 | 2 | POINT(1 4) | ...
4 | 3 | POINT(1 4) | ...
-----------------------------------
5 | 2 | POINT(3 2) | ...
6 | 4 | POINT(3 2) | ...
-----------------------------------
7 | 1 | POINT(4 3) | ...
8 | 1 | POINT(4 3) | ...
9 | 1 | POINT(4 3) | ...
10 | 5 | POINT(4 3) | ...
11 | 3 | POINT(4 3) | ...
This table is some kind of manually created view
for faster lookups (with millions of rows). Else we'd have to join 8 tables and it would get even slower. But that's not part of the problem.
Simple version:
Parameters:
- Point
p
- int
d
What the query should do:
1. The query looks for all groundtruth
Points which have a distance < d
from Point p
SQL for that is pretty easy: WHERE st_distance(groundtruth, p) < d
2. Now we have a list of groundtruth
points with their anchor_id
s. As you can see in the table above, it is possible to have multiple identical groundtruth-anchor_id tuples. For example: anchor_id=3
and groundtruth=POINT(1 4)
.
3. Next I'd like to eliminate the identical tuples, by choosing one of them randomly(!). Why not simply take the first? Because the data
column is different.
Choosing a random row in SQL: SELECT ... ORDER BY RANDOM() LIMIT 1
My problem with all of this is: I can imagine a solution using SQL LOOP
s and lot's of subqueries, but there's for sure a solution using GROUP BY
or some other methods which will make it faster.
Full version:
Basically the same as above with one difference: The input parameters change:
- lot's of Points
p1
...p312456345
- still one
d
If the simple query is working, this could be done using a LOOP
in SQL. But maybe there is a better (and faster) solution, because the database is really huge!
Solution
WITH ps AS (SELECT unnest(p_array) AS p)
SELECT DISTINCT ON (anchor_id, groundtruth)
*
FROM measurement m, ps
WHERE EXISTS (
SELECT 1
FROM ps
WHERE st_distance(m.groundtruth, ps.p) < d
)
ORDER BY anchor_id, groundtruth, random();
Thanks to Erwin Brandstetter!
To eliminate duplicates, this might be the most efficient query in PostgreSQL:
SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM measurement
WHERE st_distance(p, groundtruth) < d
More about this query style:
As mentioned in the comments this gives you an arbitrary pick. If you need random, somewhat more expensive:
SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM measurement
WHERE st_distance(p, groundtruth) < d
ORDER BY anchor_id, groundtruth, random()
The second part is harder to optimize. **EXISTS**
semi-join will probably be the fastest choice. For a given table ps (p point)
:
SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM measurement m
WHERE EXISTS (
SELECT 1
FROM ps
WHERE st_distance(ps.p, m.groundtruth) < d
)
ORDER BY anchor_id, groundtruth, random();
This can stop evaluating as soon as one p
is close enough and it keeps the rest of the query simple.
Be sure to back that up with a matching GiST index.
If you have an array as input, create a CTE with unnest()
on the fly:
WITH ps AS (SELECT unnest(p_array) AS p)
SELECT ...
Update according to comment
If you only need a single row as answer, you can simplify:
WITH ps AS (SELECT unnest(p_array) AS p)
SELECT *
FROM measurement m
WHERE EXISTS (
SELECT 1
FROM ps
WHERE st_distance(ps.p, m.groundtruth) < d
)
LIMIT 1;
Faster with ST_DWithin()
Probably more efficient with the function ST_DWithin()
(and a matching GiST index!).
To get one row (using a sub-select instead of a CTE here):
SELECT *
FROM measurement m
JOIN (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)
LIMIT 1;
To get one row for every point p
within distance d
:
SELECT DISTINCT ON (ps.p) *
FROM measurement m
JOIN (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)
Adding ORDER BY random()
will make this query more expensive. Without random()
, Postgres can just pick the first matching row from the GiST index. Else all possible matches have to be retrieved and ordered randomly.
BTW, LIMIT 1
inside EXISTS
is pointless. Read the manual at the link I provided or this related question.
这篇关于分组后按值(非列)从组中选择随机条目?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!