存储倒排索引 [英] Storing an inverted index

查看:90
本文介绍了存储倒排索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事有关信息检索的项目. 我已经使用Hadoop/Python建立了完全反向索引. Hadoop将索引输出为(单词,文档列表)对,并将其写在文件上. 为了快速访问,我使用上述文件创建了一个字典(哈希表). 我的问题是,如何在具有快速访问时间的磁盘上存储这样的索引. 目前,我正在使用python pickle模块存储字典并从中加载 但是它会将整个索引立即带入内存(或者是吗?). 请提出一种有效的索引存储和搜索方法.

I am working on a project on Info Retrieval. I have made a Full Inverted Index using Hadoop/Python. Hadoop outputs the index as (word,documentlist) pairs which are written on the file. For a quick access, I have created a dictionary(hashtable) using the above file. My question is, how do I store such an index on disk that also has quick access time. At present I am storing the dictionary using python pickle module and loading from it but it brings the whole of index into memory at once (or does it?). Please suggest an efficient way of storing and searching through the index.

我的字典结构如下(使用嵌套字典)

My dictionary structure is as follows (using nested dictionaries)

{word:{doc1:[位置],doc2:[位置],....}}

{word : {doc1:[locations], doc2:[locations], ....}}

这样我就可以得到包含一个单词的文档 dictionary [word] .keys()...等等.

so that I can get the documents containing a word by dictionary[word].keys() ... and so on.

推荐答案

搁置

目前,我正在使用python pickle模块存储字典并从中加载字典,但是它将整个索引立即带入内存中(或者是吗?).

At present I am storing the dictionary using python pickle module and loading from it but it brings the whole of index into memory at once (or does it?).

是的,它确实可以将所有内容都包含在内.

Yes it does bring it all in.

有问题吗?如果这不是实际问题,请坚持下去.

Is that a problem? If it's not an actual problem, then stick with it.

如果有问题,您有什么问题?太慢了?太快?太丰富多彩了吗?占用的内存过多?你有什么问题吗?

If it's a problem, what kind of problem do you have? Too slow? Too fast? Too colorful? Too much memory used? What problem do you have?

这篇关于存储倒排索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆