按重复属性分组 [英] Group by repeating attribute
问题描述
基本上,我有一个表messages
,其中的user_id
字段标识创建消息的用户.
Basically I have a table messages
, with user_id
field that identifies a user that created the message.
当我显示两个用户之间的对话(一组消息)时,我希望能够按user_id
分组消息,但是要采用一种棘手的方式:
When I display a conversation(set of messages) between two users, I want to be able to group the messages by user_id
, but in a tricky way:
假设有一些消息(按created_at desc
排序):
Let's say there are some messages (sorted by created_at desc
):
id: 1, user_id: 1
id: 2, user_id: 1
id: 3, user_id: 2
id: 4, user_id: 2
id: 5, user_id: 1
我想按以下顺序获得3个消息组:
[1,2], [3,4], [5]
I want to get 3 message groups in the below order:
[1,2], [3,4], [5]
它应该按* user_id *分组,直到看到一个不同的分组,然后再按该分组.
It should group by *user_id* until it sees a different one and then groups by that one.
我正在使用PostgreSQL,并且很乐意使用特定于它的东西,只要能提供最佳性能.
I'm using PostgreSQL and would be happy to use something specific to it, whatever would give the best performance.
推荐答案
正确的SQL
@Igor 提供了一种很好的带有窗口函数的纯SQL技术.
但是:
Proper SQL
@Igor presents a nice pure-SQL technique with window functions.
However:
我想按以下顺序获得3个消息组:[1,2],[3,4],[5]
I want to get 3 message groups in the below order: [1,2], [3,4], [5]
要获取请求的订单,请添加ORDER BY min(id)
:
To get the requested order, add ORDER BY min(id)
:
SELECT array_agg(id) AS ids
FROM (
SELECT id
,user_id
,row_number() OVER (ORDER BY id) -
row_number() OVER (PARTITION BY user_id ORDER BY id) AS grp
FROM messages
ORDER BY id) t -- for ordered arrays in result
GROUP BY grp, user_id
ORDER BY min(id);
增加的内容几乎不能保证另一个答案.更重要的问题是:
The addition would barely warrant another answer. The more important issue is this:
我正在使用PostgreSQL,并且很乐意使用特定于PostgreSQL的东西,无论它会提供最佳性能.
纯SQL都很好看,但程序上的服务器端功能在此任务上的执行速度更快.虽然程序上处理行通常比慢,但 plpgsql 赢得了竞争,因为它可以通过单表扫描和单 ORDER BY
操作:
Pure SQL is all nice and shiny, but a procedural server-side function is much faster for this task. While processing rows procedurally is generally slower, plpgsql wins this competition big-time, because it can make do with a single table scan and a single ORDER BY
operation:
CREATE OR REPLACE FUNCTION f_msg_groups()
RETURNS TABLE (ids int[]) AS
$func$
DECLARE
_id int;
_uid int;
_id0 int; -- id of last row
_uid0 int; -- user_id of last row
BEGIN
FOR _id, _uid IN
SELECT id, user_id FROM messages ORDER BY id
LOOP
IF _uid <> _uid0 THEN
RETURN QUERY VALUES (ids); -- output row (never happens after 1 row)
ids := ARRAY[_id]; -- start new array
ELSE
ids := ids || _id; -- add to array
END IF;
_id0 := _id;
_uid0 := _uid; -- remember last row
END LOOP;
RETURN QUERY VALUES (ids); -- output last iteration
END
$func$ LANGUAGE plpgsql;
致电:
SELECT * FROM f_msg_groups();
基准和链接
我在具有6万行的类似真实表上使用EXPLAIN ANALYZE
进行了快速测试(执行几次,选择最快的结果以排除兑现影响):
Benchmark and links
I ran a quick test with EXPLAIN ANALYZE
on a similar real life table with 60k rows (execute several times, pick fastest result to exclude cashing effects):
SQL:
总运行时间:1009.549毫秒
pl/pgSQL:
总运行时间: 336.971毫秒
SQL:
Total runtime: 1009.549 ms
Pl/pgSQL:
Total runtime: 336.971 ms
还要考虑以下密切相关的问题:
Also consider these closely related questions:
- GROUP BY and aggregate sequential numeric values
- GROUP BY consecutive dates delimited by gaps
- Ordered count of consecutive repeats / duplicates
这篇关于按重复属性分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!