使用cPickle序列化大型词典会导致MemoryError [英] Using cPickle to serialize a large dictionary causes MemoryError

查看:298
本文介绍了使用cPickle序列化大型词典会导致MemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为文档集合上的搜索引擎编写反向索引.现在,我将索引存储为字典字典.也就是说,每个关键字都映射到docIDs->出现位置的字典.

I'm writing an inverted index for a search engine on a collection of documents. Right now, I'm storing the index as a dictionary of dictionaries. That is, each keyword maps to a dictionary of docIDs->positions of occurrence.

数据模型如下所示: {word:{doc_name:[location_list]}}

The data model looks something like: {word : { doc_name : [location_list] } }

在内存中建立索引工作正常,但是当我尝试序列化到磁盘时,遇到了MemoryError.这是我的代码:

Building the index in memory works fine, but when I try to serialize to disk, I hit a MemoryError. Here's my code:

# Write the index out to disk
serializedIndex = open(sys.argv[3], 'wb')
cPickle.dump(index, serializedIndex, cPickle.HIGHEST_PROTOCOL)

在序列化之前,我的程序正在使用大约50%的内存(1.6 Gb).一旦我打电话给cPickle,我的内存使用率便猛增到80%,然后崩溃了.

Right before serialization, my program is using about 50% memory (1.6 Gb). As soon as I make the call to cPickle, my memory usage skyrockets to 80% before crashing.

为什么cPickle使用那么多的内存进行序列化?有没有更好的方法来解决这个问题?

Why is cPickle using so much memory for serialization? Is there a better way to be approaching this problem?

推荐答案

cPickle需要使用大量额外的内存,因为它会进行周期检测.如果您确定数据没有周期,则可以尝试使用封送模块

cPickle needs to use a bunch of extra memory because it does cycle detection. You could try using the marshal module if you are sure your data has no cycles

这篇关于使用cPickle序列化大型词典会导致MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆