有效查询多个条件 [英] Effective query with multiple conditions
问题描述
我有一个数据库,其中
books (primary key: bookID)
characterNames (foreign key: books.bookID)
locations (foreign key: books.bookID)
文本位置
现在,我想使用psycopg2编写一个Python脚本,以查找书中给定字符名称和给定位置的所有出现情况。 br>
目前,我执行4个查询:
The in-text-position of character names and locations are saved in the corresponding tables.
Now I want to write a Python script using psycopg2 to find all occurrences of a given character name and a given location in books, where both occur.
At the moment, I execute 4 queries:
SELECT bookID, position FROM characterNames WHERE name='XXX';
--> result is saved in list 'charnames'
SELECT DISTINCT bookID FROM characterNames WHERE name='XXX';
--> result is saved in list 'charnamesIDs'
SELECT bookID, position FROM locations WHERE locName='YYY';
--> result is saved in list 'locs'
SELECT bookID FROM locations WHERE locName='YYY';
--> result is saved in list 'locsIDs'
这两个查询都可以给我bookID,其中仅是名称或位置出现。因此,我的目标是消除书名不出现在位置中的所有字符元素,反之亦然。我的方法是:
Both queries could give me bookIDs where just the name OR the location appears. So my goal is to eliminate all elements of 'charnames' with bookIDs not occuring in 'locs' and the other way round. My approach was:
for cnameTuple in charnames:
~if cnameTuple[0] in locsIDs:
~~continue
~del(cname)
我在其中的元组做了一个相应的循环locs。
不幸的是,该算法需要大量时间。有没有一种方法可以更快地执行此任务?
I made a corresponding loop for the tuples in locs.
This algorithm unfortunately needs a lot of time. Is there a way to perform this task quicker?
推荐答案
使用以下方法可以更快,更简单
像这样的东西:
This could be much faster and simpler with a query using JOINs.
Something like this:
SELECT b.*, c.position, l.position
FROM books b
JOIN characternames c USING (bookid)
JOIN locations l USING (bookid)
WHERE c.name = 'XXX'
AND l.locname = 'YYY';
-
仅从数据库服务器中检索所需的数据,而不是在应用中获取过多的数据并进行过滤。数据库引擎已针对过滤和排序进行了优化。首先,您几乎无法在应用程序中将其匹配。
It is generally a good idea to only retrieve the data you need from the database server instead of fetching excessive data and filtering in your app. Database engines are optimized for filtering and sorting. You can hardly ever match that in your app to begin with.
请注意我如何使用所有小写字母的名称。阅读有关标识符的手册。
Note how I use all lower case names. Read the manual about identifiers.
对于像PostgreSQL这样设计的RDBMS,成千上万的书根本没有问题。处理百万。大表性能的关键是正确的索引。对于此处的查询,以下索引可能会有所帮助:
"Thousands of books" are no problem at all for a RDBMS like PostgreSQL that is designed to handle millions. The key to performance with large tables are proper indexes. For the queries here the following indexes will potentially help:
CREATE INDEX books_bookid_idx ON books(bookid); -- a primary key will do, too CREATE INDEX cn_bookid_idx ON characternames (bookid); CREATE INDEX cn_name_idx ON characternames (name); CREATE INDEX locations_bookid_idx ON locations (bookid); CREATE INDEX locations_locname_idx ON locations (locname);
多列索引的效果可能更好。使用
EXPLAIN ANALYZE
,它将显示使用了哪些索引以及查询的速度。创建索引非常快,尝试它们很容易。只是不要保留不需要的索引。Multicolumn indexes may perform even better. Test with
EXPLAIN ANALYZE
, it will show you which indexes get used and how fast the query is. Creating indexes is very fast, experimenting with them is easy. Just don't keep indexes you don't need. They carry a maintenance cost, too.我
思考
我现在明白了,您在寻找什么。应该优化此查询以获取每个bookid
的位置或名称的所有位置,但是仅在同一本书中显示名称和的位置,并且每本书没有更多详细信息:I
think
I understand now, what you are looking for. This query should be optimized to get all positions of locations or names perbookid
, but only where name and location show up in the same book, and no further details per book:WITH b AS ( SELECT bookid FROM characternames WHERE name = 'XXX' GROUP BY 1 INTERSECT SELECT bookid FROM locations WHERE l.locname = 'YYY' GROUP BY 1 ) SELECT bookid, position, 'char' AS what FROM b JOIN characternames USING (bookid) WHERE name = 'XXX' UNION ALL SELECT bookid, position, 'loc' AS what FROM b JOIN locations USING (bookid) WHERE locname = 'YYY' ORDER BY bookid, position;
要点
- CTE(
有
查询)确保基本查询只执行一次。 -
INTERSECT
仅选择bookids
同时具有位置和名称。 - 最终最终
SELECT
返回的UNION ALL
all 找到了职位。如果要裁切具有相同位置的重复项,请使用UNION
代替。 - 我按
bookid,位置排序
- 添加了列
内容
以标记源(位置或名称) - The CTE (
WITH
query) makes sure the base query is only executed once. INTERSECT
picks onlybookids
that feature both location and name.- The
UNION ALL
in the finalSELECT
returns all found positions. UseUNION
instead if you want to trim duplicates with the same position. - I order by
bookid, position
- guessing that is what's needed. - Added a column
what
to tag the source (location or name) of a position.
Major points
如果每本书搜索词出现很多次,您可以通过为
创建带有与众不同条目的辅助表,从而大大加快搜索速度。 bookid,term)
。在两列上创建一个多列主索引,并在term
上创建另一个。为位置创建一个这样的表,为名称创建另一个。如果需要,可以使用触发器使它们保持最新状态,但是我认为书籍的内容变化不大。将简化并加快CTE。If search terms appear many times per book you could considerably speed up the search by creating auxiliary tables with distinct entries for
(bookid, term)
. Create a multicolumn primary index on the two columns and an additional one on justterm
. Create one such table for locations and another one for names. Keep them up to date with triggers if need should be, but I assume the content of books is not changing much. Would simplify and speed up the CTE.如果这还不够快,请查看全文搜索。
If that still isn't fast enough, look into Full Text Search.
这篇关于有效查询多个条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- CTE(