有效的查询合并了两个以上的子查询 [英] Effective query merging more than 2 subqueries

查看:95
本文介绍了有效的查询合并了两个以上的子查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据库,其中

books          (primary key: bookID)
characterNames (foreign key: books.bookID) 
locations      (foreign key: books.bookID)

文本位置

我正在使用psycopg2编写Python脚本,查找书中给定字符名称和位置的所有出现情况。我只希望在书中出现字符名称和位置的地方。

在这里我已经找到了一种搜索一个位置和一个字符的解决方案:

The in-text-position of character names and locations are saved in the corresponding tables.
I'm writing a Pythonscript using psycopg2, finding all occurences of given character names and locations in books. I only want the occurences in books, where both the character name AND the location are found.
Here I already got a solution for searching one location and one character:

WITH b AS (  
    SELECT bookid  
    FROM   characternames  
    WHERE  name = 'XXX'  
    GROUP  BY 1  
    INTERSECT  
    SELECT bookid  
    FROM   locations  
    WHERE  l.locname = 'YYY'  
    GROUP  BY 1  
    )  
SELECT bookid, position, 'char' AS what  
FROM   b  
JOIN   characternames USING (bookid)  
WHERE  name = 'XXX'  
UNION  ALL  
SELECT bookid, position, 'loc' AS what  
FROM   b  
JOIN   locations USING (bookid)  
WHERE  locname = 'YYY'  
ORDER  BY bookid, position;  

CTE'b'包含所有书号,其中字符名称'XXX'和位置'YYY '出现。

The CTE 'b' contains all bookid's, where the character name 'XXX' and the location 'YYY' appear.

现在我还想知道要搜索2个位置和一个名称(或分别搜索2个名称和一个位置)。如果所有搜索到的实体都必须在一本书中出现,这很简单,但是这是怎么回事:

搜索:Tim,Al,工具店
结果:包括

的书籍(蒂姆, Al,工具店)或

(蒂姆,铝)或

(蒂姆,工具店)或

(Al,工具店)

Now I'm additionally wondering about searching for 2 places and a name (or 2 names and a place respectively). It's simple if all searched entities must occur in one book, but what about this:
Searching for: Tim, Al, Toolshop Results: books including
(Tim, Al, Toolshop) or
(Tim, Al) or
(Tim, Toolshop) or
(Al, Toolshop)

在4、5、6 ...条件下可能会重复出现该问题。

我想与更多子查询进行相交,但这是行不通的。
相反,我会将找到的bookID进行UNION,将它们分组,然后选择发生更多的bookID:

The problem could be repeated for 4, 5, 6...conditions.
I thougt about INTERSECTing more subqueries, but that wouldn't work.
Instead I would UNION the found bookIDs, GROUP them and select bookid's occurring more then once:

WITH b AS (  
    SELECT bookid, count(bookid) AS occurrences  
    FROM  
        (SELECT DISTINCT bookid  
        FROM characterNames  
        WHERE name='XXX'  
        UNION  
        SELECT DISTINCT bookid  
        FROM characterNames  
        WHERE name='YYY'  
        UNION  
        SELECT DISTINCT bookid  
        FROM locations  
        WHERE locname='ZZZ'  
        GROUP BY bookid)  
    WHERE occurrences>1)  

我认为这可行,不能现在测试一下,但这是最好的方法吗?

I think this works, can't test it at the moment, but is it the best way to do this?

推荐答案

将计数用于广义的想法案子是有声音的。但是,对语法进行了一些调整:

The idea to use a count for the generalized case is sound. A couple of adjustments to the syntax, though:

WITH b AS (  
   SELECT bookid
   FROM  (
      SELECT DISTINCT bookid  
      FROM   characterNames  
      WHERE  name='XXX'  

      UNION ALL  
      SELECT DISTINCT bookid  
      FROM   characterNames  
      WHERE  name='YYY'  

      UNION ALL
      SELECT DISTINCT bookid  
      FROM   locations  
      WHERE  locname='ZZZ'  
      ) x
   GROUP  BY bookid
   HAVING count(*) > 1
   )
SELECT bookid, position, 'char' AS what
FROM   b
JOIN   characternames USING (bookid)
WHERE  name = 'XXX'

UNION  ALL
SELECT bookid, position, 'loc' AS what
FROM   b
JOIN   locations USING (bookid)
WHERE  locname = 'YYY'
ORDER  BY bookid, position;



注释




  • 使用 UNION ALL (而不是 UNION )保留子查询之间的重复项。在这种情况下,您希望它们能够对其进行计数。

    Notes

    • Use UNION ALL (not UNION) to preserve duplicates between the subqueries. You want them in this case to be able to count them.

      子查询应该产生不同的值。它与 DISTINCT 一起使用。您可能想尝试来代替它,看看效果是否更好(我不希望这样。)

      The subqueries are supposed to produces distinct values. It works with DISTINCT the way you have it. You may want to try GROUP BY 1 instead and see if that performs better (I don't expect it to.)

      GROUP BY 不必超出子查询范围。它将仅应用于最后一个子查询,并且在那里没有任何意义,因为您已经有 DISTINCT bookid

      The GROUP BY hast to go outside the subquery. It would only be applied to the last subquery and makes no sense there as you have DISTINCT bookid already.

      检查一本书是否有多个匹配项必须进入 HAVING 子句:

      The check whether there are more than one hits on a book has to go into a HAVING clause:

       HAVING count(*) > 1
      

      您不能在 WHERE 子句。

      您不能简单地在一张桌子上组合多个条件。您将如何计算发现的数量?但是有一种更复杂的方法。可能会或可能不会改善性能,您必须进行测试(使用 EXPLAIN ANALYZE )。这两个查询都需要对表 characterNames 进行至少两次索引扫描。至少会缩短语法。

      You cannot simply combine multiple conditions on one table. How will you count the number of findings? But there is a somewhat more sophisticated way. May or may not improve performance, You'll have to test (with EXPLAIN ANALYZE). Both queries require at least two index scans for the table characterNames. At least it shortens the syntax.

      考虑如何计算 characterNames 的匹配数以及如何更改为总和(命中)在外部 SELECT

      Consider how I compute the number of hits for characterNames and how I changed to sum(hits) in the outer SELECT:

      WITH b AS (  
         SELECT bookid
         FROM  (
            SELECT bookid
                 , max((name='XXX')::int)
                 + max((name='YYY')::int) AS hits
            FROM   characterNames  
            WHERE  (name='XXX'
                 OR name='YYY')
            GROUP  BY bookid
      
            UNION ALL
            SELECT DISTINCT bookid, 1 AS hits  
            FROM   locations  
            WHERE  locname='ZZZ'  
            ) x
         GROUP  BY bookid
         HAVING sum(hits) > 1
         )
      ...
      

      布尔值转换为整数会得到 0 表示 FALSE 1 表示 TRUE 。会有所帮助。

      Converting a boolean to integer gives 0 for FALSE and 1 for TRUE. That helps.

      骑自行车到我公司,这东西一直在我的脑后踢。我有理由相信此查询可能会更快。请尝试一下:

      While riding my bike to my company this thing kept kicking at the back of my head. I have reason to believe this query might be even faster. Please give it a try:

      WITH b AS (  
         SELECT bookid
      
              , (EXISTS (
                  SELECT *
                  FROM   characterNames c
                  WHERE  c.bookid = b.bookid
                  AND    c.name = 'XXX'))::int
              + (EXISTS (
                  SELECT *
                  FROM   characterNames c
                  WHERE  c.bookid = b.bookid
                  AND    c.name = 'YYY'))::int AS c_hits
      
              , (EXISTS (
                  SELECT *
                  FROM   locations l
                  WHERE  l.bookid = b.bookid
                  AND    l.locname='ZZZ'))::int AS l_hits
         FROM   books b  
         WHERE  (c_hits + l_hits) > 1
         )
      SELECT c.bookid, c.position, 'char' AS what
      FROM   b
      JOIN   characternames c USING (bookid)
      WHERE  b.c_hits > 0
      AND    c.name IN ('XXX', 'YYY')
      
      UNION  ALL
      SELECT l.bookid, l.position, 'loc' AS what
      FROM   b
      JOIN   locations l USING (bookid)
      WHERE  b.l_hits > 0
      AND    l.locname = 'YYY'
      ORDER  BY 1,2,3;
      




      • 存在半联接可以在第一个匹配项处停止执行。由于我们只对CTE中的全有或全无答案感兴趣,因此可以更快地完成工作。

        • The EXISTS semi-join can stop executing at the first match. As we are only interested in an all-or-nothing answer in the CTE, this could possibly do the job much faster.

          这样,我们也不需要进行汇总(没有必要)。

          This way we also don't need to aggregate (no GROUP BY necessary).

          I还记得是否找到了任何个字符或位置,并且仅重新访问具有实际匹配项的表。

          I also remember whether any characters or locations were found and only revisit tables with actual matches.

          这篇关于有效的查询合并了两个以上的子查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆