搜索引擎中的有效低基数AND [英] Efficient low-cardinality ANDs in a search engine
问题描述
Lucene等搜索引擎如何在数据集中的术语对许多文档通用的情况下执行AND查询?例如,在倒排索引中:
How do search engines such as Lucene, etc. perform AND queries where a term is common to many documents in the dataset? For example, in an inverted index of:
term | document_id
---------------------
program | 1, 2, 3, 5...
python | 1, 4
code | 4
c++ | 4, 5
术语 program
存在于多个文档中,这意味着对 program AND code
的查询将需要对大量文档进行相交.
the term program
is present in several documents meaning a query of program AND code
would require performing an intersection upon a very large set of documents.
是否有一种方法可以执行AND查询,而不必采用可能包含数十亿个文档的术语的交集?
Is there a way to perform AND queries without having to take the intersection of terms contained by potentially billions of documents?
推荐答案
程序"一词存在于多个文档中,这意味着对程序和代码的查询将需要对非常大的一组文档执行相交.
the term program is present in several documents meaning a query of program AND code would require performing an intersection upon a very large set of documents.
是的.假设您有以下查询:
Yes. Say you have the following query:
term1
AND
term2AND
term3
term1
AND
term2AND
term3
您首先需要计算每个正项的文档频率.您选择计数最低的单词.
You first need to compute the document frequency of each positive term. You pick the word with the lowest count.
您从查询中检索包含最不常用术语的文档.那些是候选人.然后,使用有限状态机对查询进行筛选并为其打分.
You retrieve the documents that contains the least common term from the query. Those are the candidates. Then you filter and score those candidates with the query with a finite state machine.
所以数据库有几个子空间:
So the database has several subspace:
- 从引理或词干或术语到文档频率的映射(如tfidf)
- 允许检索包含给定引理的文档的实际倒排索引
- 取决于查询逻辑的先进程度,文档ID与文档的全文表示或一小袋单词之间的映射.
然后,过滤器+得分步骤可以并行发生
Then the filter + score step can happen in parallel
这篇关于搜索引擎中的有效低基数AND的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!