有效的查询合并了两个以上的子查询 [英] Effective query merging more than 2 subqueries
问题描述
我有一个数据库,其中
books (primary key: bookID)
characterNames (foreign key: books.bookID)
locations (foreign key: books.bookID)
文本位置
我正在使用psycopg2编写Python脚本,查找书中给定字符名称和位置的所有出现情况。我只希望在书中出现字符名称和位置的地方。
在这里我已经找到了一种搜索一个位置和一个字符的解决方案:
The in-text-position of character names and locations are saved in the corresponding tables.
I'm writing a Pythonscript using psycopg2, finding all occurences of given character names and locations in books. I only want the occurences in books, where both the character name AND the location are found.
Here I already got a solution for searching one location and one character:
WITH b AS (
SELECT bookid
FROM characternames
WHERE name = 'XXX'
GROUP BY 1
INTERSECT
SELECT bookid
FROM locations
WHERE l.locname = 'YYY'
GROUP BY 1
)
SELECT bookid, position, 'char' AS what
FROM b
JOIN characternames USING (bookid)
WHERE name = 'XXX'
UNION ALL
SELECT bookid, position, 'loc' AS what
FROM b
JOIN locations USING (bookid)
WHERE locname = 'YYY'
ORDER BY bookid, position;
CTE'b'包含所有书号,其中字符名称'XXX'和位置'YYY '出现。
The CTE 'b' contains all bookid's, where the character name 'XXX' and the location 'YYY' appear.
现在我还想知道要搜索2个位置和一个名称(或分别搜索2个名称和一个位置)。如果所有搜索到的实体都必须在一本书中出现,这很简单,但是这是怎么回事:
搜索:Tim,Al,工具店
结果:包括
的书籍(蒂姆, Al,工具店)或
(蒂姆,铝)或
(蒂姆,工具店)或
(Al,工具店)
Now I'm additionally wondering about searching for 2 places and a name (or 2 names and a place respectively). It's simple if all searched entities must occur in one book, but what about this:
Searching for: Tim, Al, Toolshop
Results: books including
(Tim, Al, Toolshop) or
(Tim, Al) or
(Tim, Toolshop) or
(Al, Toolshop)
在4、5、6 ...条件下可能会重复出现该问题。
我想与更多子查询进行相交,但这是行不通的。
相反,我会将找到的bookID进行UNION,将它们分组,然后选择发生更多的bookID:
The problem could be repeated for 4, 5, 6...conditions.
I thougt about INTERSECTing more subqueries, but that wouldn't work.
Instead I would UNION the found bookIDs, GROUP them and select bookid's occurring more then once:
WITH b AS (
SELECT bookid, count(bookid) AS occurrences
FROM
(SELECT DISTINCT bookid
FROM characterNames
WHERE name='XXX'
UNION
SELECT DISTINCT bookid
FROM characterNames
WHERE name='YYY'
UNION
SELECT DISTINCT bookid
FROM locations
WHERE locname='ZZZ'
GROUP BY bookid)
WHERE occurrences>1)
我认为这可行,不能现在测试一下,但这是最好的方法吗?
I think this works, can't test it at the moment, but is it the best way to do this?
推荐答案
将计数用于广义的想法案子是有声音的。但是,对语法进行了一些调整:
The idea to use a count for the generalized case is sound. A couple of adjustments to the syntax, though:
WITH b AS (
SELECT bookid
FROM (
SELECT DISTINCT bookid
FROM characterNames
WHERE name='XXX'
UNION ALL
SELECT DISTINCT bookid
FROM characterNames
WHERE name='YYY'
UNION ALL
SELECT DISTINCT bookid
FROM locations
WHERE locname='ZZZ'
) x
GROUP BY bookid
HAVING count(*) > 1
)
SELECT bookid, position, 'char' AS what
FROM b
JOIN characternames USING (bookid)
WHERE name = 'XXX'
UNION ALL
SELECT bookid, position, 'loc' AS what
FROM b
JOIN locations USING (bookid)
WHERE locname = 'YYY'
ORDER BY bookid, position;
注释
-
使用
UNION ALL
(而不是UNION
)保留子查询之间的重复项。在这种情况下,您希望它们能够对其进行计数。Notes
Use
UNION ALL
(notUNION
) to preserve duplicates between the subqueries. You want them in this case to be able to count them.子查询应该产生不同的值。它与
DISTINCT
一起使用。您可能想尝试来代替它,看看效果是否更好(我不希望这样。) The subqueries are supposed to produces distinct values. It works with
DISTINCT
the way you have it. You may want to tryGROUP BY 1
instead and see if that performs better (I don't expect it to.)GROUP BY
不必超出子查询范围。它将仅应用于最后一个子查询,并且在那里没有任何意义,因为您已经有DISTINCT bookid
。The
GROUP BY
hast to go outside the subquery. It would only be applied to the last subquery and makes no sense there as you haveDISTINCT bookid
already.检查一本书是否有多个匹配项必须进入
HAVING
子句:The check whether there are more than one hits on a book has to go into a
HAVING
clause:HAVING count(*) > 1
您不能在
WHERE $ c $中使用汇总值c>子句。
您不能简单地在一张桌子上组合多个条件。您将如何计算发现的数量?但是有一种更复杂的方法。可能会或可能不会改善性能,您必须进行测试(使用
EXPLAIN ANALYZE
)。这两个查询都需要对表characterNames
进行至少两次索引扫描。至少会缩短语法。You cannot simply combine multiple conditions on one table. How will you count the number of findings? But there is a somewhat more sophisticated way. May or may not improve performance, You'll have to test (with
EXPLAIN ANALYZE
). Both queries require at least two index scans for the tablecharacterNames
. At least it shortens the syntax.考虑如何计算
characterNames
的匹配数以及如何更改为总和(命中)
在外部SELECT
:Consider how I compute the number of hits for
characterNames
and how I changed tosum(hits)
in the outerSELECT
:WITH b AS ( SELECT bookid FROM ( SELECT bookid , max((name='XXX')::int) + max((name='YYY')::int) AS hits FROM characterNames WHERE (name='XXX' OR name='YYY') GROUP BY bookid UNION ALL SELECT DISTINCT bookid, 1 AS hits FROM locations WHERE locname='ZZZ' ) x GROUP BY bookid HAVING sum(hits) > 1 ) ...
将
布尔值
转换为整数
会得到0
表示FALSE
,1
表示TRUE
。会有所帮助。Converting a
boolean
tointeger
gives0
forFALSE
and1
forTRUE
. That helps.骑自行车到我公司,这东西一直在我的脑后踢。我有理由相信此查询可能会更快。请尝试一下:
While riding my bike to my company this thing kept kicking at the back of my head. I have reason to believe this query might be even faster. Please give it a try:
WITH b AS ( SELECT bookid , (EXISTS ( SELECT * FROM characterNames c WHERE c.bookid = b.bookid AND c.name = 'XXX'))::int + (EXISTS ( SELECT * FROM characterNames c WHERE c.bookid = b.bookid AND c.name = 'YYY'))::int AS c_hits , (EXISTS ( SELECT * FROM locations l WHERE l.bookid = b.bookid AND l.locname='ZZZ'))::int AS l_hits FROM books b WHERE (c_hits + l_hits) > 1 ) SELECT c.bookid, c.position, 'char' AS what FROM b JOIN characternames c USING (bookid) WHERE b.c_hits > 0 AND c.name IN ('XXX', 'YYY') UNION ALL SELECT l.bookid, l.position, 'loc' AS what FROM b JOIN locations l USING (bookid) WHERE b.l_hits > 0 AND l.locname = 'YYY' ORDER BY 1,2,3;
-
存在半联接可以在第一个匹配项处停止执行。由于我们只对CTE中的全有或全无答案感兴趣,因此可以更快地完成工作。
The
EXISTS
semi-join can stop executing at the first match. As we are only interested in an all-or-nothing answer in the CTE, this could possibly do the job much faster.这样,我们也不需要进行汇总(没有
必要)。 This way we also don't need to aggregate (no
GROUP BY
necessary).I还记得是否找到了任何个字符或位置,并且仅重新访问具有实际匹配项的表。
I also remember whether any characters or locations were found and only revisit tables with actual matches.
这篇关于有效的查询合并了两个以上的子查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
-