使用python pickle加载大型词典 [英] Loading a large dictionary using python pickle

查看：127 发布时间：2020/5/27 20:17:26 python pickle inverted-index

本文介绍了使用python pickle加载大型词典的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个完整的倒排索引，形式为嵌套python字典.其结构为:

I have a full inverted index in form of nested python dictionary. Its structure is :

{word : { doc_name : [location_list] } }

例如，将字典称为index，然后对于"spam"一词，条目应类似于:

For example let the dictionary be called index, then for a word " spam ", entry would look like :

{ spam : { doc1.txt : [102,300,399], doc5.txt : [200,587] } }

我使用了这种结构，因为python dict非常优化，并且使编程更容易.

I used this structure as python dict are pretty optimised and it makes programming easier.

对于垃圾邮件"一词，包含该文件的文档可以通过以下方式给出:

for any word 'spam', the documents containig it can be given by :

index['spam'].keys()

和文档doc1的过帐清单通过:

and posting list for a document doc1 by:

index['spam']['doc1']

目前，我正在使用cPickle来存储和加载此字典.但是，腌制后的文件大约为380 MB，并且加载时间很长-112秒(大约是我使用 time.time()对其计时)，内存使用量达到1.2 GB(Gnome系统监视器) .一旦加载，就可以了.我有4GB RAM.

At present I am using cPickle to store and load this dictionary. But the pickled file is around 380 MB and takes a long time to load - 112 seconds(approx. I timed it using time.time()) and memory usage goes to 1.2 GB (Gnome system monitor). Once it loads, its fine. I have 4GB RAM.

len(index.keys())给出229758

import cPickle as pickle

f = open('full_index','rb')
print 'Loading index... please wait...'
index = pickle.load(f)  # This takes ages
print 'Index loaded. You may now proceed to search'

如何使它加载更快?在应用程序启动时，我只需要加载一次.之后，访问时间对于响应查询很重要.

How can I make it load faster? I only need to load it once, when the application starts. After that, the access time is important to respond to queries.

我应该切换到类似SQLite的数据库并在其键上创建索引吗?如果是，如何存储值以具有等效的架构，这使检索变得容易.还有什么我需要研究的吗?

Should I switch to a database like SQLite and create an index on its keys? If yes, how do I store the values to have an equivalent schema, which makes retrieval easy. Is there anything else that I should look into ?

使用Tim的答案pickle.dump(index, file, -1)腌制的文件要小得多-大约237 MB(转储需要300秒)...并且现在只需一半的时间即可加载(61秒...)，而之前的时间是112 s. ... time.time())

Using Tim's answer pickle.dump(index, file, -1) the pickled file is considerably smaller - around 237 MB (took 300 seconds to dump)... and takes half the time to load now (61 seconds ... as opposed to 112 s earlier .... time.time())

但是我应该迁移到数据库以实现可伸缩性吗?

But should I migrate to a database for scalability ?

就目前而言，我将蒂姆的答案标记为已接受.

As for now I am marking Tim's answer as accepted.

PS:我不想使用Lucene或Xapian ... 该问题涉及存储反向索引.我不得不问一个新问题，因为我无法删除前一个问题.

PS :I don't want to use Lucene or Xapian ... This question refers Storing an inverted index . I had to ask a new question because I wasn't able to delete the previous one.

使用python pickle加载大型词典 [英] Loading a large dictionary using python pickle

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用python pickle加载大型词典 [英] Loading a large dictionary using python pickle

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭