分组后按值(非列)从组中选择随机条目? [英] Select a random entry from a group after grouping by a value (not column)?

查看:124
本文介绍了分组后按值(非列)从组中选择随机条目?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用Postgres和PostGIS编写查询。我也使用Rails和 rgeo rgeo-activerecord activerecord-postgis-adapter ,但Rails的东西并不重要。



表结构:

<$ p
- int id
- int anchor_id
- Point groundtruth
- data(对查询不重要)

示例数据:

  id | anchor_id | groundtruth |数据
-----------------------------------
1 | 1 | POINT(1 4)| ...
2 | 3 | POINT(1 4)| ...
3 | 2 | POINT(1 4)| ...
4 | 3 | POINT(1 4)| ...
-----------------------------------
5 | 2 | POINT(3 2)| ...
6 | 4 | POINT(3 2)| ...
-----------------------------------
7 | 1 | POINT(4 3)| ...
8 | 1 | POINT(4 3)| ...
9 | 1 | POINT(4 3)| ...
10 | 5 | POINT(4 3)| ...
11 | 3 | POINT(4 3)| ...

这张表是某种手动创建的 view 用于更快的查找(具有数百万行)。否则我们不得不加入8张桌子,否则它会变得更慢。但这不是问题的一部分。






简单版本:



参数:




  • 积分 p

  • int d



查询应该做什么:



1. 查询查找所有 groundtruth 具有距离的点< p



SQL很容易: WHERE st_distance(groundtruth,p)< d



2。现在我们有 groundtruth 点与他们的 anchor_id s。正如你在上面的表格中看到的,可能有多个相同的groundtruth-anchor_id元组。例如: anchor_id = 3 groundtruth = POINT(1 4)



3。接下来,我想通过随机选择其中一个(!)来消除相同的元组。为什么不简单地采取第一个?因为 data 列不同。



选择SQL中的随机行: SELECT。 .. ORDER BY RANDOM()LIMIT 1



我遇到的问题是:我可以想象使用SQL LOOP s和很多的子查询,肯定有一个解决方案,使用 GROUP BY 或其他一些方法将使其更快。

完整版:



基本上与上面相同,区别在于:输入参数更改:


  • 点数 p1 ... p312456345

  • 还有一个 d



如果简单查询正在工作,可以在SQL中使用 LOOP 来完成。但也许有一个更好的(和更快的)解决方案,因为数据库是真的很大!






解决方案



  WITH ps AS(SELECT unnest(p_array)AS p)
SELECT DISTINCT ON(anchor_id,groundtruth)
*
FROM measurement m,ps
WHERE EXISTS(
SELECT 1
FROM ps
WHERE st_distance(m.groundtruth,ps.p)< d

ORDER BY anchor_id,groundtruth,random();

感谢Erwin Brandstetter!

解决方案

为了消除重复,这可能是PostgreSQL中效率最高的查询:

  SELECT DISTINCT ON (anchor_id,groundtruth)* 
FROM measurement
WHERE st_distance(p,groundtruth)< d

关于这种查询风格的更多信息:



正如评论中提到的那样,这给你一个任意选择。如果您需要随机,则稍微贵一些:

  SELECT DISTINCT ON(anchor_id,groundtruth)* 
FROM measurement
WHERE st_distance(p,groundtruth)< d
ORDER BY anchor_id,groundtruth,random()






第二部分难以优化。 ** EXISTS * * 半连接可能是最快的选择。对于给定的表 ps(p点)

  SELECT DISTINCT ON (anchor_id,groundtruth)* 
FROM measurement m
WHERE EXISTS(
SELECT 1
FROM ps
WHERE st_distance(ps.p,m.groundtruth)< d

ORDER BY anchor_id,groundtruth,random();

只要一个 p 足够接近,它使查询的其余部分变得简单。



请务必用一个匹配的 GiST索引

。 b

如果您有一个数组作为输入,请创建一个 CTE ()

  WITH ps AS SELECT unnest(p_array)AS p)
SELECT ...



根据评论更新



如果您只需要单行作为答案,则可以简化:

  WITH ps AS(SELECT unnest(p_arra y)AS p)
SELECT *
FROM measurement m
WHERE EXISTS(
SELECT 1
FROM ps
WHERE st_distance(ps.p,m。 groundtruth) d

极限1;



加快 ST_DWithin()



用函数 ST_DWithin() (和一个匹配的GiST索引!)。
获取一个在这里选择而不是CTE):

  SELECT * 
FROM measurement m
JOIN(SELECT unnest( p_array)AS p)ps ON ST_DWithin(ps.p,m.groundtruth,d)
LIMIT 1;

要获得所有点的一行 p d :

ON(ps.p)*
FROM measurement m
JOIN(SELECT unnest(p_array)AS p)ps ON ST_DWithin(ps.p,m.groundtruth,d)

添加 ORDER BY random()会使这个查询更加昂贵的。如果没有 random(),Postgres可以从GiST索引中选择第一个匹配的行。






顺便说一下,<$ c 所有可能的匹配都必须随机检索和排序。 $ c> LIMIT 1 EXISTS 中是毫无意义的。在我提供的链接上阅读手册。这个相关的问题


I want to write a query using Postgres and PostGIS. I'm also using Rails with rgeo, rgeo-activerecord and activerecord-postgis-adapter, but the Rails stuff is rather unimportant.

The table structure:

measurement
 - int id
 - int anchor_id
 - Point groundtruth
 - data (not important for the query)

Example data:

id | anchor_id | groundtruth | data
-----------------------------------
1  | 1         | POINT(1 4)  | ...
2  | 3         | POINT(1 4)  | ...
3  | 2         | POINT(1 4)  | ...
4  | 3         | POINT(1 4)  | ...
-----------------------------------
5  | 2         | POINT(3 2)  | ...
6  | 4         | POINT(3 2)  | ...
-----------------------------------
7  | 1         | POINT(4 3)  | ...
8  | 1         | POINT(4 3)  | ...
9  | 1         | POINT(4 3)  | ...
10 | 5         | POINT(4 3)  | ...
11 | 3         | POINT(4 3)  | ...

This table is some kind of manually created view for faster lookups (with millions of rows). Else we'd have to join 8 tables and it would get even slower. But that's not part of the problem.


Simple version:

Parameters:

  • Point p
  • int d

What the query should do:

1. The query looks for all groundtruth Points which have a distance < d from Point p

SQL for that is pretty easy: WHERE st_distance(groundtruth, p) < d

2. Now we have a list of groundtruth points with their anchor_ids. As you can see in the table above, it is possible to have multiple identical groundtruth-anchor_id tuples. For example: anchor_id=3 and groundtruth=POINT(1 4).

3. Next I'd like to eliminate the identical tuples, by choosing one of them randomly(!). Why not simply take the first? Because the data column is different.

Choosing a random row in SQL: SELECT ... ORDER BY RANDOM() LIMIT 1

My problem with all of this is: I can imagine a solution using SQL LOOPs and lot's of subqueries, but there's for sure a solution using GROUP BY or some other methods which will make it faster.

Full version:

Basically the same as above with one difference: The input parameters change:

  • lot's of Points p1 ... p312456345
  • still one d

If the simple query is working, this could be done using a LOOP in SQL. But maybe there is a better (and faster) solution, because the database is really huge!


Solution

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT DISTINCT ON (anchor_id, groundtruth)
    *
FROM measurement m, ps
WHERE EXISTS (
    SELECT 1
    FROM ps
    WHERE st_distance(m.groundtruth, ps.p) < d
)
ORDER BY anchor_id, groundtruth, random();

Thanks to Erwin Brandstetter!

解决方案

To eliminate duplicates, this might be the most efficient query in PostgreSQL:

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement
WHERE  st_distance(p, groundtruth) < d

More about this query style:

As mentioned in the comments this gives you an arbitrary pick. If you need random, somewhat more expensive:

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement
WHERE  st_distance(p, groundtruth) < d
ORDER  BY anchor_id, groundtruth, random()


The second part is harder to optimize. **EXISTS** semi-join will probably be the fastest choice. For a given table ps (p point):

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement m
WHERE  EXISTS (
   SELECT 1
   FROM   ps
   WHERE  st_distance(ps.p, m.groundtruth) < d
   )
ORDER  BY anchor_id, groundtruth, random();

This can stop evaluating as soon as one p is close enough and it keeps the rest of the query simple.

Be sure to back that up with a matching GiST index.

If you have an array as input, create a CTE with unnest() on the fly:

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT ...

Update according to comment

If you only need a single row as answer, you can simplify:

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT *
FROM   measurement m
WHERE  EXISTS (
   SELECT 1
   FROM   ps
   WHERE  st_distance(ps.p, m.groundtruth) < d
   )
LIMIT  1;

Faster with ST_DWithin()

Probably more efficient with the function ST_DWithin() (and a matching GiST index!).
To get one row (using a sub-select instead of a CTE here):

SELECT *
FROM   measurement m
JOIN  (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)
LIMIT  1;

To get one row for every point p within distance d:

SELECT DISTINCT ON (ps.p) *
FROM   measurement m
JOIN  (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)

Adding ORDER BY random() will make this query more expensive. Without random(), Postgres can just pick the first matching row from the GiST index. Else all possible matches have to be retrieved and ordered randomly.


BTW, LIMIT 1 inside EXISTS is pointless. Read the manual at the link I provided or this related question.

这篇关于分组后按值(非列)从组中选择随机条目?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆