用于查找具有特定数量关联的行的 SQL 查询 [英] SQL query to find a row with a specific number of associations
问题描述
使用 Postgres,我有一个包含 conversations
和 conversationUsers
的架构.每个conversation
都有很多conversationUsers
.我希望能够找到具有确切指定数量的 conversationUsers
的对话.换句话说,如果提供了一个 userIds
数组(例如,[1, 4, 6]
),我希望能够找到仅包含这些用户的对话,并且不再.
到目前为止,我已经尝试过:
选择 c."conversationId"来自conversationUsers"cWHERE c."userId" IN (1, 4)GROUP BY c."conversationId"有 COUNT(c."userId") = 2;
不幸的是,这似乎也返回了包括这 2 个用户在内的对话.(例如,如果会话还包含 "userId"
5,则返回结果).
这是一个关系分部 - 添加了特殊要求,即同一个对话不应有其他用户.
假设是表 "conversationUsers"
的 PK,它强制组合的唯一性,NOT NULL
并且还隐式提供了对性能至关重要的索引.多列PK的列按this的顺序!否则你必须做更多.
关于索引列的顺序:
对于基本查询,有 "brute force" 方法来计算所有给定用户的 all 对话的匹配用户数量,然后过滤匹配的用户所有给定的用户.对于小型表格和/或只有较短的输入数组和/或每个用户的对话很少,但不能很好地扩展:
SELECT "conversationId"来自conversationUsers"cWHERE "userId" = ANY ('{1,4,6}'::int[])按 1 分组HAVING count(*) = array_length('{1,4,6}'::int[], 1)不存在(从对话用户"中选择WHERE "conversationId" = c."conversationId"AND用户ID"<>ALL('{1,4,6}'::int[]));
使用 NOT EXISTS
反半联接消除与其他用户的对话.更多:
替代技术:
还有其他各种(更快)关系划分查询技术.但最快的那些并不适合 动态 数量的用户 ID.
对于也可以处理动态数量的用户 ID 的快速查询,请考虑使用 递归 CTE:
带有递归 rcte AS (选择conversationId",1 作为 idx来自对话用户"WHERE "userId" = ('{1,4,6}'::int[])[1]联合所有选择 c."conversationId", r.idx + 1发件人加入conversationUsers" c 使用(conversationId")WHERE c."userId" = ('{1,4,6}'::int[])[idx + 1])选择对话 ID"发件人WHERE idx = array_length(('{1,4,6}'::int[]), 1)不存在(从对话用户"中选择WHERE "conversationId" = r."conversationId"AND用户ID"<>ALL('{1,4,6}'::int[]));
为了便于使用,将其包装在一个函数或 准备好的语句中.喜欢:
准备对话(int[]) AS使用递归 rcte AS (选择conversationId",1 作为 idx来自对话用户"WHERE 用户 ID" = $1[1]联合所有选择 c."conversationId", r.idx + 1发件人加入conversationUsers" c 使用(conversationId")其中 c."userId" = $1[idx + 1])选择对话 ID"发件人WHERE idx = array_length($1, 1)不存在(从对话用户"中选择WHERE "conversationId" = r."conversationId"AND用户ID"<>全部($1);
呼叫:
执行对话('{1,4,6}');
db<>小提琴这里(也演示一个函数)
仍有改进的余地:要获得最佳性能,您必须将会话最少的用户放在输入数组的首位,以便尽早消除尽可能多的行.要获得最佳性能,您可以动态生成非动态、非递归查询(使用第一个链接中的一种 fast 技术)并依次执行.您甚至可以使用动态 SQL 将其包装在单个 plpgsql 函数中......
更多解释:
替代方案:稀疏写表的 MV
如果表 "conversationUsers"
大部分是只读的(旧对话不太可能更改),您可以使用 MATERIALIZED VIEW
在排序数组中预先聚合用户,并在该数组列上创建一个普通的 btree 索引.
CREATE MATERIALIZED VIEW mv_conversation_users ASSELECT "conversationId", array_agg("userId") AS users -- 排序数组从 (选择对话 ID"、用户 ID"来自对话用户"按 1、2 订购) 子按 1 分组按 1 排序;在 mv_conversation_users (users) 上创建索引 INCLUDE ("conversationId");
演示的覆盖索引需要 Postgres 11.请参阅:
关于子查询中的行排序:
在旧版本中,在 (users, "conversationId")
上使用普通的多列索引.对于非常长的数组,哈希索引在 Postgres 10 或更高版本中可能有意义.
那么更快的查询就是:
SELECT "conversationId"FROM mv_conversation_users cWHERE users = '{1,4,6}'::int[];-- 排序数组!
db<>小提琴这里p>
您必须权衡增加的存储、写入和维护成本与读取性能的好处.
另外:考虑不带双引号的合法标识符.conversation_id
而不是 "conversationId"
等:
Using Postgres I have a schema that has conversations
and conversationUsers
. Each conversation
has many conversationUsers
. I want to be able to find the conversation that has the exactly specified number of conversationUsers
. In other words, provided an array of userIds
(say, [1, 4, 6]
) I want to be able to find the conversation that contains only those users, and no more.
So far I've tried this:
SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."userId" IN (1, 4)
GROUP BY c."conversationId"
HAVING COUNT(c."userId") = 2;
Unfortunately, this also seems to return conversations which include these 2 users among others. (For example, it returns a result if the conversation also includes "userId"
5).
This is a case of relational-division - with the added special requirement that the same conversation shall have no additional users.
Assuming is the PK of table "conversationUsers"
which enforces uniqueness of combinations, NOT NULL
and also provides the index essential for performance implicitly. Columns of the multicolumn PK in this order! Else you have to do more.
About the order of index columns:
For the basic query, there is the "brute force" approach to count the number of matching users for all conversations of all given users and then filter the ones matching all given users. OK for small tables and/or only short input arrays and/or few conversations per user, but doesn't scale well:
SELECT "conversationId"
FROM "conversationUsers" c
WHERE "userId" = ANY ('{1,4,6}'::int[])
GROUP BY 1
HAVING count(*) = array_length('{1,4,6}'::int[], 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = c."conversationId"
AND "userId" <> ALL('{1,4,6}'::int[])
);
Eliminating conversations with additional users with a NOT EXISTS
anti-semi-join. More:
Alternative techniques:
There are various other, (much) faster relational-division query techniques. But the fastest ones are not well suited for a dynamic number of user IDs.
For a fast query that can also deal with a dynamic number of user IDs, consider a recursive CTE:
WITH RECURSIVE rcte AS (
SELECT "conversationId", 1 AS idx
FROM "conversationUsers"
WHERE "userId" = ('{1,4,6}'::int[])[1]
UNION ALL
SELECT c."conversationId", r.idx + 1
FROM rcte r
JOIN "conversationUsers" c USING ("conversationId")
WHERE c."userId" = ('{1,4,6}'::int[])[idx + 1]
)
SELECT "conversationId"
FROM rcte r
WHERE idx = array_length(('{1,4,6}'::int[]), 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = r."conversationId"
AND "userId" <> ALL('{1,4,6}'::int[])
);
For ease of use wrap this in a function or prepared statement. Like:
PREPARE conversations(int[]) AS
WITH RECURSIVE rcte AS (
SELECT "conversationId", 1 AS idx
FROM "conversationUsers"
WHERE "userId" = $1[1]
UNION ALL
SELECT c."conversationId", r.idx + 1
FROM rcte r
JOIN "conversationUsers" c USING ("conversationId")
WHERE c."userId" = $1[idx + 1]
)
SELECT "conversationId"
FROM rcte r
WHERE idx = array_length($1, 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = r."conversationId"
AND "userId" <> ALL($1);
Call:
EXECUTE conversations('{1,4,6}');
db<>fiddle here (also demonstrating a function)
There is still room for improvement: to get top performance you have to put users with the fewest conversations first in your input array to eliminate as many rows as possible early. To get top performance you can generate a non-dynamic, non-recursive query dynamically (using one of the fast techniques from the first link) and execute that in turn. You could even wrap it in a single plpgsql function with dynamic SQL ...
More explanation:
Alternative: MV for sparsely written table
If the table "conversationUsers"
is mostly read-only (old conversations are unlikely to change) you might use a MATERIALIZED VIEW
with pre-aggregated users in sorted arrays and create a plain btree index on that array column.
CREATE MATERIALIZED VIEW mv_conversation_users AS
SELECT "conversationId", array_agg("userId") AS users -- sorted array
FROM (
SELECT "conversationId", "userId"
FROM "conversationUsers"
ORDER BY 1, 2
) sub
GROUP BY 1
ORDER BY 1;
CREATE INDEX ON mv_conversation_users (users) INCLUDE ("conversationId");
The demonstrated covering index requires Postgres 11. See:
About sorting rows in a subquery:
In older versions use a plain multicolumn index on (users, "conversationId")
. With very long arrays, a hash index might make sense in Postgres 10 or later.
Then the much faster query would simply be:
SELECT "conversationId"
FROM mv_conversation_users c
WHERE users = '{1,4,6}'::int[]; -- sorted array!
db<>fiddle here
You have to weigh added costs to storage, writes and maintenance against benefits to read performance.
Aside: consider legal identifiers without double quotes. conversation_id
instead of "conversationId"
etc.:
这篇关于用于查找具有特定数量关联的行的 SQL 查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!