搜索引擎中的有效低基数AND [英] Efficient low-cardinality ANDs in a search engine

查看:71
本文介绍了搜索引擎中的有效低基数AND的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Lucene等搜索引擎如何在数据集中的术语对许多文档通用的情况下执行AND查询?例如,在倒排索引中:

How do search engines such as Lucene, etc. perform AND queries where a term is common to many documents in the dataset? For example, in an inverted index of:

term    | document_id
---------------------
program | 1, 2, 3, 5...
python  | 1, 4
code    | 4
c++     | 4, 5

术语 program 存在于多个文档中,这意味着对 program AND code 的查询将需要对大量文档进行相交.

the term program is present in several documents meaning a query of program AND code would require performing an intersection upon a very large set of documents.

是否有一种方法可以执行AND查询,而不必采用可能包含数十亿个文档的术语的交集?

Is there a way to perform AND queries without having to take the intersection of terms contained by potentially billions of documents?

推荐答案

程序"一词存在于多个文档中,这意味着对程序和代码的查询将需要对非常大的一组文档执行相交.

the term program is present in several documents meaning a query of program AND code would require performing an intersection upon a very large set of documents.

是的.假设您有以下查询:

Yes. Say you have the following query:

term1 AND term2 AND term3

term1 AND term2 AND term3

您首先需要计算每个正项文档频率.您选择计数最低的单词.

You first need to compute the document frequency of each positive term. You pick the word with the lowest count.

您从查询中检索包含最不常用术语的文档.那些是候选人.然后,使用有限状态机对查询进行筛选并为其打分.

You retrieve the documents that contains the least common term from the query. Those are the candidates. Then you filter and score those candidates with the query with a finite state machine.

所以数据库有几个子空间:

So the database has several subspace:

  1. 从引理或词干或术语到文档频率的映射(如tfidf)
  2. 允许检索包含给定引理的文档的实际倒排索引
  3. 取决于查询逻辑的先进程度,文档ID与文档的全文表示或一小袋单词之间的映射.

然后,过滤器+得分步骤可以并行发生

Then the filter + score step can happen in parallel

这篇关于搜索引擎中的有效低基数AND的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆