使用cPickle序列化大型词典会导致MemoryError [英] Using cPickle to serialize a large dictionary causes MemoryError
问题描述
我正在为文档集合上的搜索引擎编写反向索引.现在,我将索引存储为字典字典.也就是说,每个关键字都映射到docIDs->出现位置的字典.
I'm writing an inverted index for a search engine on a collection of documents. Right now, I'm storing the index as a dictionary of dictionaries. That is, each keyword maps to a dictionary of docIDs->positions of occurrence.
数据模型如下所示: {word:{doc_name:[location_list]}}
The data model looks something like: {word : { doc_name : [location_list] } }
在内存中建立索引工作正常,但是当我尝试序列化到磁盘时,遇到了MemoryError.这是我的代码:
Building the index in memory works fine, but when I try to serialize to disk, I hit a MemoryError. Here's my code:
# Write the index out to disk
serializedIndex = open(sys.argv[3], 'wb')
cPickle.dump(index, serializedIndex, cPickle.HIGHEST_PROTOCOL)
在序列化之前,我的程序正在使用大约50%的内存(1.6 Gb).一旦我打电话给cPickle,我的内存使用率便猛增到80%,然后崩溃了.
Right before serialization, my program is using about 50% memory (1.6 Gb). As soon as I make the call to cPickle, my memory usage skyrockets to 80% before crashing.
为什么cPickle使用那么多的内存进行序列化?有没有更好的方法来解决这个问题?
Why is cPickle using so much memory for serialization? Is there a better way to be approaching this problem?
推荐答案
cPickle需要使用大量额外的内存,因为它会进行周期检测.如果您确定数据没有周期,则可以尝试使用封送模块
cPickle needs to use a bunch of extra memory because it does cycle detection. You could try using the marshal module if you are sure your data has no cycles
这篇关于使用cPickle序列化大型词典会导致MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!