使用python pickle加载大型词典 [英] Loading a large dictionary using python pickle

查看:127
本文介绍了使用python pickle加载大型词典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个完整的倒排索引,形式为嵌套python字典.其结构为:

I have a full inverted index in form of nested python dictionary. Its structure is :

{word : { doc_name : [location_list] } }

例如,将字典称为index,然后对于"spam"一词,条目应类似于:

For example let the dictionary be called index, then for a word " spam ", entry would look like :

{ spam : { doc1.txt : [102,300,399], doc5.txt : [200,587] } }

我使用了这种结构,因为python dict非常优化,并且使编程更容易.

I used this structure as python dict are pretty optimised and it makes programming easier.

对于垃圾邮件"一词,包含该文件的文档可以通过以下方式给出:

for any word 'spam', the documents containig it can be given by :

index['spam'].keys()

和文档doc1的过帐清单通过:

and posting list for a document doc1 by:

index['spam']['doc1']

目前,我正在使用cPickle来存储和加载此字典.但是,腌制后的文件大约为380 MB,并且加载时间很长-112秒(大约是我使用 time.time()对其计时),内存使用量达到1.2 GB(Gnome系统监视器) .一旦加载,就可以了.我有4GB RAM.

At present I am using cPickle to store and load this dictionary. But the pickled file is around 380 MB and takes a long time to load - 112 seconds(approx. I timed it using time.time()) and memory usage goes to 1.2 GB (Gnome system monitor). Once it loads, its fine. I have 4GB RAM.

len(index.keys())给出229758

import cPickle as pickle

f = open('full_index','rb')
print 'Loading index... please wait...'
index = pickle.load(f)  # This takes ages
print 'Index loaded. You may now proceed to search'

如何使它加载更快?在应用程序启动时,我只需要加载一次.之后,访问时间对于响应查询很重要.

How can I make it load faster? I only need to load it once, when the application starts. After that, the access time is important to respond to queries.

我应该切换到类似SQLite的数据库并在其键上创建索引吗?如果是,如何存储值以具有等效的架构,这使检索变得容易.还有什么我需要研究的吗?

Should I switch to a database like SQLite and create an index on its keys? If yes, how do I store the values to have an equivalent schema, which makes retrieval easy. Is there anything else that I should look into ?

使用Tim的答案pickle.dump(index, file, -1)腌制的文件要小得多-大约237 MB(转储需要300秒)...并且现在只需一半的时间即可加载(61秒...),而之前的时间是112 s. ... time.time())

Using Tim's answer pickle.dump(index, file, -1) the pickled file is considerably smaller - around 237 MB (took 300 seconds to dump)... and takes half the time to load now (61 seconds ... as opposed to 112 s earlier .... time.time())

但是我应该迁移到数据库以实现可伸缩性吗?

But should I migrate to a database for scalability ?

就目前而言,我将蒂姆的答案标记为已接受.

As for now I am marking Tim's answer as accepted.

PS:我不想使用Lucene或Xapian ... 该问题涉及存储反向索引.我不得不问一个新问题,因为我无法删除前一个问题.

PS :I don't want to use Lucene or Xapian ... This question refers Storing an inverted index . I had to ask a new question because I wasn't able to delete the previous one.

推荐答案

在使用cPickle.dump/cPickle.dumps时尝试使用协议参数.来自cPickle.Pickler.__doc__:

Try the protocol argument when using cPickle.dump/cPickle.dumps. From cPickle.Pickler.__doc__:

Pickler(file,protocol = 0)-创建一个Pickler.

Pickler(file, protocol=0) -- Create a pickler.

这需要一个类似文件的对象来编写泡菜数据流. 可选的proto参数告诉选择器使用给定的 协议;支持的协议为0、1、2.默认 协议为0,以向后兼容. (协议0是 只能写入文本打开文件的协议 模式并成功读回.当使用更高的协议时 大于0,请确保在两种情况下均以二进制模式打开文件 酸洗和解酸.)

This takes a file-like object for writing a pickle data stream. The optional proto argument tells the pickler to use the given protocol; supported protocols are 0, 1, 2. The default protocol is 0, to be backwards compatible. (Protocol 0 is the only protocol that can be written to a file opened in text mode and read back successfully. When using a protocol higher than 0, make sure the file is opened in binary mode, both when pickling and unpickling.)

协议1比协议0更有效;协议2是 比协议1更有效.

Protocol 1 is more efficient than protocol 0; protocol 2 is more efficient than protocol 1.

指定否定协议版本会选择最高版本 支持协议版本.使用的协议越高, 较新版本的Python才能读取泡菜 产生.

Specifying a negative protocol version selects the highest protocol version supported. The higher the protocol used, the more recent the version of Python needed to read the pickle produced.

file参数必须具有一个write()方法,该方法可以接受一个 字符串参数.因此,它可以是一个打开的文件对象,即StringIO 对象,或符合此接口的任何其他自定义对象.

The file parameter must have a write() method that accepts a single string argument. It can thus be an open file object, a StringIO object, or any other custom object that meets this interface.

大多数情况下,转换JSON或YAML的时间可能比腌制更长的时间-pickle存储本地Python类型.

Converting JSON or YAML will probably take longer than pickling most of the time - pickle stores native Python types.

这篇关于使用python pickle加载大型词典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆