SQL查询以查找具有特定数量关联的行 [英] SQL query to find a row with a specific number of associations

查看:104
本文介绍了SQL查询以查找具有特定数量关联的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Postgres,我有一个具有个会话 conversationUsers 的架构。每个会话有许多 conversationUsers 。我希望能够找到具有确切指定数量的 conversationUsers 的对话。换句话说,提供了一个我想要成为的 userIds 数组(例如, [1、4、6] )能够找到仅包含这些用户的对话,而不再包含其他用户。



到目前为止,我已经尝试过:

  SELECT c。  conversationId 
来自 conversationUsers c
c。 userId IN(1,4)
GROUP BY c。 conversationId
HAVING COUNT(c。 userId )= 2;

不幸的是,这似乎还会返回包含这两个用户的对话。 (例如,如果对话中还包含 userId 5,则返回结果)。

解决方案

这是-附加的特殊要求是,同一对话中不得有 additional 用户。 / p>

假设是表 conversationUsers 的PK,它强制执行组合的唯一性,不能为空,并且还隐含地提供了性能必不可少的索引。多列PK的列按 this 顺序!

关于索引列的顺序:





对于基本查询,有一种蛮力 方法来计算所有给定用户的 all 会话的匹配用户数,然后过滤匹配所有给定用户的用户。对于较小的表和/或仅简短的输入数组和/或每个用户很少的对话,是可以的,但是扩展性不好

 从 conversationUsers中选择 conversationId 
c
WHERE userId = ANY('{1,4,6}':: int [])
GROUP BY 1
具有count(*)= array_length('{1,4,6}':: int [],1)
并且不存在(
从 conversationUsers中选择
WHERE conversationId = c。 conversationId
AND userId<> ALL('{1,4,6}':: int [])
);

通过不存在反半联接。更多:





替代技术:





还有很多其他方法,(更快)查询技术。但是最快的用户ID不太适合动态 用户。





对于快速查询处理动态数量的用户ID,请考虑 递归CTE

 使用递归rcte AS(
选择 conversationId,1个AS idx
FROM conversationUsers
WHERE userId =('{1,4,6}':: int [])[1]

UNION ALL
SELECT c .. conversationId,r.idx + 1
from rcte r
JOIN conversationUsers c使用( conversationId)
W. C. userId =('{1,4, 6}':: int [])[idx + 1]

选择 conversationId
从rcte r
WHERE idx = array_length(('{1,4,6}':: int []),1​​)
并且不存在(
从 conversationUsers
中选择 conversationId = r。 conversationId
AND userId<> ALL(’{1,4,6}':: int [])
);

为便于使用,请将其包装在函数或准备好的语句。像这样:

  PREPARE对话(int [])AS 
与RECURSIVE rcte AS(
SELECT conversationId ,1个AS idx
来自 conversationUsers
WHERE userId = $ 1 [1]

UNION ALL
SELECT c。 conversationId,r.idx + 1
从rc $ r
加入 conversationUsers c使用( conversationId)
W. C. userId = $ 1 [idx + 1]

从rcte r
中选择 conversationId
idx = array_length($ 1,1)
并且不存在(
从 conversationUsers中选择
conversationId = r。 conversationId
和 userId<> ALL($ 1);

通话:

 执行对话('{1,4,6}'); 

db<> fiddle 此处 (也演示了功能



仍有改进的空间:要获得 top 性能,您必须将会话最少的用户放在输入数组中,以尽早消除尽可能多的行。为了获得最佳性能,您可以动态生成一个非动态,非递归查询(使用第一个链接中的 fast 技术之一)并依次执行。您甚至可以将其包装在具有动态SQL的单个plpgsql函数中...



更多说明:





替代:稀疏表的MV



如果表 conversationUsers 主要是只读的(旧的对话不太可能更改),则可以使用 材料视图 ,其中包含预先排序的用户,并按排序的数组进行创建btree该数组列上的索引。 )AS用户-排序数组
FROM(
SELECT conversationId, userId
FROM convers ationUsers
ORDER BY 1,2
)sub
GROUP BY 1
ORDER BY 1;

在mv_conversation_users(用户)上创建索引INCLUDE( conversationId);

已证明的涵盖指数要求使用Postgres11。请参阅:





关于对子查询中的行进行排序:





在旧版本中,在(用户, conversationId)上使用普通的多列索引。对于非常长的数组,散列索引可能在Postgres 10或更高版本中有意义。



然后,查询速度更快:

  SELECT conversationId 
来自mv_conversation_users c
WHERE用户='{1,4,6}':: int []; -排序数组!

db<>小提琴此处



您必须权衡存储,写入和存储的成本



此外:考虑不带双引号的合法标识符。 conversation_id 代替 conversationId 等:




Using Postgres I have a schema that has conversations and conversationUsers. Each conversation has many conversationUsers. I want to be able to find the conversation that has the exactly specified number of conversationUsers. In other words, provided an array of userIds (say, [1, 4, 6]) I want to be able to find the conversation that contains only those users, and no more.

So far I've tried this:

SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."userId" IN (1, 4)
GROUP BY c."conversationId"
HAVING COUNT(c."userId") = 2;

Unfortunately, this also seems to return conversations which include these 2 users among others. (For example, it returns a result if the conversation also includes "userId" 5).

解决方案

This is a case of - with the added special requirement that the same conversation shall have no additional users.

Assuming is the PK of table "conversationUsers" which enforces uniqueness of combinations, NOT NULL and also provides the index essential for performance implicitly. Columns of the multicolumn PK in this order! Else you have to do more.
About the order of index columns:

For the basic query, there is the "brute force" approach to count the number of matching users for all conversations of all given users and then filter the ones matching all given users. OK for small tables and/or only short input arrays and/or few conversations per user, but doesn't scale well:

SELECT "conversationId"
FROM   "conversationUsers" c
WHERE  "userId" = ANY ('{1,4,6}'::int[])
GROUP  BY 1
HAVING count(*) = array_length('{1,4,6}'::int[], 1)
AND    NOT EXISTS (
   SELECT FROM "conversationUsers"
   WHERE  "conversationId" = c."conversationId"
   AND    "userId" <> ALL('{1,4,6}'::int[])
   );

Eliminating conversations with additional users with a NOT EXISTS anti-semi-join. More:

Alternative techniques:

There are various other, (much) faster query techniques. But the fastest ones are not well suited for a dynamic number of user IDs.

For a fast query that can also deal with a dynamic number of user IDs, consider a recursive CTE:

WITH RECURSIVE rcte AS (
   SELECT "conversationId", 1 AS idx
   FROM   "conversationUsers"
   WHERE  "userId" = ('{1,4,6}'::int[])[1]

   UNION ALL
   SELECT c."conversationId", r.idx + 1
   FROM   rcte                r
   JOIN   "conversationUsers" c USING ("conversationId")
   WHERE  c."userId" = ('{1,4,6}'::int[])[idx + 1]
   )
SELECT "conversationId"
FROM   rcte r
WHERE  idx = array_length(('{1,4,6}'::int[]), 1)
AND    NOT EXISTS (
   SELECT FROM "conversationUsers"
   WHERE  "conversationId" = r."conversationId"
   AND    "userId" <> ALL('{1,4,6}'::int[])
   );

For ease of use wrap this in a function or prepared statement. Like:

PREPARE conversations(int[]) AS
WITH RECURSIVE rcte AS (
   SELECT "conversationId", 1 AS idx
   FROM   "conversationUsers"
   WHERE  "userId" = $1[1]

   UNION ALL
   SELECT c."conversationId", r.idx + 1
   FROM   rcte                r
   JOIN   "conversationUsers" c USING ("conversationId")
   WHERE  c."userId" = $1[idx + 1]
   )
SELECT "conversationId"
FROM   rcte r
WHERE  idx = array_length($1, 1)
AND    NOT EXISTS (
   SELECT FROM "conversationUsers"
   WHERE  "conversationId" = r."conversationId"
   AND    "userId" <> ALL($1);

Call:

EXECUTE conversations('{1,4,6}');

db<>fiddle here (also demonstrating a function)

There is still room for improvement: to get top performance you have to put users with the fewest conversations first in your input array to eliminate as many rows as possible early. To get top performance you can generate a non-dynamic, non-recursive query dynamically (using one of the fast techniques from the first link) and execute that in turn. You could even wrap it in a single plpgsql function with dynamic SQL ...

More explanation:

Alternative: MV for sparsely written table

If the table "conversationUsers" is mostly read-only (old conversations are unlikely to change) you might use a MATERIALIZED VIEW with pre-aggregated users in sorted arrays and create a plain btree index on that array column.

CREATE MATERIALIZED VIEW mv_conversation_users AS
SELECT "conversationId", array_agg("userId") AS users  -- sorted array
FROM (
   SELECT "conversationId", "userId"
   FROM   "conversationUsers"
   ORDER  BY 1, 2
   ) sub
GROUP  BY 1
ORDER  BY 1;

CREATE INDEX ON mv_conversation_users (users) INCLUDE ("conversationId");

The demonstrated covering index requires Postgres 11. See:

About sorting rows in a subquery:

In older versions use a plain multicolumn index on (users, "conversationId"). With very long arrays, a hash index might make sense in Postgres 10 or later.

Then the much faster query would simply be:

SELECT "conversationId"
FROM   mv_conversation_users c
WHERE  users = '{1,4,6}'::int[];  -- sorted array!

db<>fiddle here

You have to weigh added costs to storage, writes and maintenance against benefits to read performance.

Aside: consider legal identifiers without double quotes. conversation_id instead of "conversationId" etc.:

这篇关于SQL查询以查找具有特定数量关联的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆