如何在倒排索引结构中搜索短语查询? [英] How to search phrase queries in inverted index structure?

查看:217
本文介绍了如何在倒排索引结构中搜索短语查询?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我们要在倒排索引结构中搜索类似"t1 t2 t3"(t1,t2,t3必须排队)的查询, 我们应该怎么做?

If we want to search a query like this "t1 t2 t3" (t1,t2 ,t3 must be queued) in an inverted index structure , which ways should we do ?

1-首先我们搜索"t1"项,找到所有包含"t1"的文档,然后对"t2"然后是"t3"进行此操作.然后找到位置"t1","t2"和"t3"彼此相邻的文档.

1-First we search the "t1" term and find all documents that contains "t1" , then do this work for "t2" and then "t3" . Then find documents that positions of "t1" , "t2" and "t3" are next to each other .

2-首先,我们搜索"t1"项并找到包含"t1"的所有文档,然后在找到的所有文档中搜索"t2",然后,在此结果中,找到以下文档:包含"t3".

2-First we search the "t1" term and find all documents that contains "t1" , then in all documents that we found , we search the "t2" and next , in the result of this , we find documents that contains "t3" .

我有一个完整的倒排索引.我想知道上面的哪些方法是优化的(1)或(2)吗?

I have a full inverted index . I want to know which ways above is optimized , (1) or (2) ?

非常感谢.

推荐答案

作为维基百科条目很好解释,

有两个主要的变体 倒排索引:记录级别 倒排索引(或倒排文件索引 或仅倒排的文件)包含一个列表 每个文件的参考编号 单词. 单词级倒排索引(或 完整倒排索引倒排列表) 另外包含以下位置 文档中的每个单词.这 后一种形式提供更多功能 (例如词组搜索),但还需要更多 要创建的时间和空间.

There are two main variants of inverted indexes: A record level inverted index (or inverted file index or just inverted file) contains a list of references to documents for each word. A word level inverted index (or full inverted index or inverted list) additionally contains the positions of each word within a document. The latter form offers more functionality (like phrase searches), but needs more time and space to be created.

由于您没有告诉我们您拥有哪种变体,因此我们无法真正准确地回答您的问题,但是考虑每种可能性会有所帮助.

Since you don't tell us which variant you have, we can't really answer your question precisely, but thinking about each possibility will help.

打开和搜索文档通常是一项昂贵的操作,除非您的文档非常小,所以您希望将其最小化-选项(2)并没有真正将其最小化.如果您有一个倒排的列表,则使用选项(1),您甚至都不需要打开任何文档;如果您只有一个倒排的文件,则不可避免地需要打开文档并进行扫描(因为否则会缺少确认单词邻接的信息),但是至少使用选项(1)可以最大程度地减少您必须打开和扫描的文档数(仅包含每个单词的文档列表的交集中的那些文档).

To open and search documents is typically a costly operation, unless your documents are unusually small, so you want to minimize that -- and option (2) doesn't really minimize it. If you have an inverted list, with option (1) you won't even need to open any document; if you only have an inverted file, you'll inevitably need to open documents and scan them (since you otherwise lack information to confirm word adjacency) -- but at least with option (1) you minimize the number of documents you have to open and scan (only those in the intersection of the lists of documents containing each word).

因此,无论哪种情况,选项(1)都更有希望(除非您的文档特别小).

So, in either case, option (1) is more promising (unless your documents are peculiarly small).

这篇关于如何在倒排索引结构中搜索短语查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆