如何建立一个简单的倒排索引? [英] How to build a simple inverted index?

查看:505
本文介绍了如何建立一个简单的倒排索引?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在没有任何API的情况下构建一个简单的搜索引擎索引功能,例如Lucene。在倒排索引中,我只需要记录每个单词的基本信息,例如docID,position和freqence。



现在,我有几个问题:


  1. 通常使用什么样的数据结构来构建倒排索引?多维列表?


  2. 构建索引后,如何将其写入文件?文件中有哪种格式?像一张桌子?就像在纸上画一个索引表一样?



解决方案

你可以看到一个非常在 TinySearchEngine 中简单实现倒排索引和搜索。



对于你的第一个问题,如果你想构建一个简单的(内存中)倒排索引,那么简单的数据结构就是这样的哈希映射:

  val invertedIndex = new collection.mutable.HashMap [String,List [Posting]] 

或Java-esque:

  HashMap< String,List< Posting>> invertedIndex = new HashMap< String,List< Postring>>(); 

哈希将每个术语/单词/标记映射到过帐列表。 发布只是一个对象,表示文档中出现的单词:

 案例类发布(docId:Int,var termFrequency:Int)

索引新文档只需将其标记(用标记/单词分隔),并为每个标记在哈希映射的正确列表中插入新的过帐。当然,如果该特定docId中的该术语已存在,则增加termFrequency。还有其他方法可以做到这一点。对于内存中的倒置索引,这没关系,但对于磁盘索引,您可能希望使用正确的 termFrequency Postings 一次c $ c>而不是每次更新。



关于你的第二个问题,通常有两种情况:



(1)你有一个(几乎)不可变的索引。您可以将所有数据编入索引一次,如果有新数据,则可以重新编制索引。例如,没有必要在一小时内多次实时或索引。



(2)新文件一直到达,你需要搜索新签到的文件尽快。



对于案例(1),您至少可以有2个文件:



1 - 反向索引文件。它为每个术语列出所有过帐(docId / termFrequency对)。这里用纯文本表示,但通常存储为二进制数据。

  Term1< docId1,termFreq>< docId2,termFreq>< ; docId3,termFreq>< docId4,termFreq>< docId5,termFreq>< docId6,termFreq>< docId7,termFreq> 
Term2< docId3,termFreq>< docId5,termFreq>< docId9,termFreq>< docId10,termFreq>< docId11,termFreq>
Term3< docId1,termFreq>< docId3,termFreq>< docId10,termFreq>
Term4< docId5,termFreq>< docId7,termFreq>< docId10,termFreq>< docId12,termFreq>
...
TermN< docId5,termFreq>< docId7,termFreq>

2-偏移文件。为每个术语存储偏移量,以在倒排索引文件中查找其反转列表。这里我用字符表示偏移量但你通常会存储二进制数据,因此偏移量将以字节为单位。此文件可以在启动时加载到内存。当您需要查找术语反转列表时,查找其偏移量并从文件中读取反转列表。

  Term1  - > ; 0 
Term2 - > 126
Term3 - > 222
....

除了这2个文件,你可以(通常会)有文件来存储每个术语的 IDF 和每个文档的标准。



对于案例(2),我将尝试简要解释如何 Lucene (并因此Solr ElasticSearch )这样做。



文件格式可以与上面解释的相同。主要区别在于,在Lucene等系统中索引新文档而不是从头开始重建索引时,只需创建一个只包含新文档的新文档。因此,每次必须索引某些内容时,都要在新的分离索引中进行索引。



要在此拆分索引中执行查询,可以对每个索引运行查询不同的索引(并行)并在返回给用户之前将结果合并在一起。



Lucene称这个小索引



这里显而易见的问题是你会很快得到很多小段。为避免这种情况,您需要一个合并细分和创建更大细分的策略。例如,如果您有超过 N个段,您可以决定合并所有小于 10 KB 的段。 / p>

I wanna build a simple indexing function of search engine without any API, such as Lucene. In the inverted index, I just need to record basic information of each word, e.g. docID, position, and freqence.

Now, I have several questions:

  1. What kind of data structure is often used for building inverted index? Multidimensional list?

  2. After building the index, how to write it into files? What kind of format in the file? Like a table? Like drawing a index table on paper?

解决方案

You can see a very simple implementation of inverted index and search in TinySearchEngine.

For your first question, if you want to build a simple (in memory) inverted index the straightforward data structure is a Hash map like this:

val invertedIndex = new collection.mutable.HashMap[String, List[Posting]]

or a Java-esque:

HashMap<String, List<Posting>> invertedIndex = new HashMap<String, List<Postring>>();

The hash maps each term/word/token to a list of Postings. A Posting is just an object that represents an occurrence of a word inside a document:

case class Posting(docId:Int, var termFrequency:Int)

Indexing a new document is just a matter of tokenizing it (separating in tokens/words) and for each token insert a new Posting in the correct List of the hash map. Of course, if a Posting already exists for that term in that specific docId, you increase the termFrequency. There are other ways of doing this. For in memory inverted indexes this is OK, but for on-disk indexes you'd probably want to insert Postings once with the correct termFrequency instead of updating it every time.

Regarding your second question, there are normally two cases:

(1) you have an (almost) immutable index. You index all your data once and if you have new data you can just reindex. There is no need to real-time or indexing many times in an hour, for example.

(2) new documents arrive all the time, and you need to search the newly arrived documents as soon as possible.

For case (1), you can have at least 2 files:

1 - The Inverted Index file. It lists for each term all Postings (docId/termFrequency pairs). Here represented in plain text, but normally stored as binary data.

 Term1<docId1,termFreq><docId2,termFreq><docId3,termFreq><docId4,termFreq><docId5,termFreq><docId6,termFreq><docId7,termFreq>
 Term2<docId3,termFreq><docId5,termFreq><docId9,termFreq><docId10,termFreq><docId11,termFreq>
 Term3<docId1,termFreq><docId3,termFreq><docId10,termFreq>
 Term4<docId5,termFreq><docId7,termFreq><docId10,termFreq><docId12,termFreq>
 ...
 TermN<docId5,termFreq><docId7,termFreq>

2- The offset file. Stores for each term the offset to find its inverted list in the inverted index file. Here I'm representing the offset in characters but you'll normally store binary data, so the offset will be in bytes. This file can be loaded to memory at startup time. When you need to lookup a term inverted list, you lookup its offset and read the inverted list from the file.

Term1 -> 0
Term2 -> 126
Term3 -> 222
....

Along with this 2 files you can (and generally will) have file(s) to store each term's IDF and each document's norm.

For case (2), I'll try to briefly explain how Lucene (and consequently Solr and ElasticSearch) do it.

The file format can be the same as explained above. The main difference is when you index new documents in systems like Lucene instead of rebuilding the index from scratch they just create a new one with only the new documents. So every time you have to index something, you do it in a new separated index.

To perform a query in this "splitted" index you can run the query against each different index (in parallel) and merge the results together before returning to the user.

Lucene calls this "little" indexes segments.

The obvious concern here is that you'll get a lot of little segments very quick. To avoid this, you'll need a policy for merging segments and creating larger segments. For example, if you have more than N segments you can decide to merge all segments smaller than 10 KBs together.

这篇关于如何建立一个简单的倒排索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆