如何存储大字典? [英] How to store a big dictionary?

查看:127
本文介绍了如何存储大字典?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大字典(28 MB)'MyDict'存储在一个 MyDict.py 文件中。



如果我执行语句:

 从MyDict导入MyDict 

A MemoryError 异常抛出。



我使用 cPickle shelve 模块访问此字典。



如何将这个 MyDict.py 文件写入 cPickle shelve 不访问MyDict。



此MyDict是通过写入文件生成的。
这是一个字典中的键值对:

  {ABCD:[[ (u'2011-03-21',35.5,37.5,35.3,35.55,4434.0,35.55),(u'2012-03-03',86.0,87.95,85.55,86.2,30587.0,86.2)(u'2011 -03-23',36.9,36.9,35.25,36.1,456.0,36.1)],
[(u'2011-03-18',37.0,38.0,36.5,36.5,861.0,36.5),(u '2012-03-03',86.0,87.95,85.55,86.2,30587.0,86.2),(u'2011-03-21',35.5,37.5,35.3,35.55,10434.0,35.55)],
[ (u'2011-03-16',37.0,37.9,36.3,36.7,3876.0,36.7),(u'2012-03-03',86.0,87.95,85.55,86.2,30587.0,86.2)(u'2011 -03-21',35.5,37.5,35.3,35.85,10434.0,35.55]],
[(u'2010-12-09',40.5,41.95,36.3,3.75,42943.0,36.75),(u '2011-10-26',67.95,71.9,66.45,70.35,180812.0,70.35),(u'2011-03-21',35.5,37.5,35.3,35.55,10434.0,35.55)],
[ (u'2009-01-16',14.75,15.0,14.0,14.15,14999.0,14.05),(u'2010-01-11',50.0,52.8,49.0,50.95,174826.0,50.95) ,(u'2009-01-27',14.3,15.0,13.9,14.15,3862.0,14.15)]]}


解决方案

shelve 其实是一个很好的选择。它的行为就像一个字典,但它由一个BDB(或类似的)键值数据库文件支持,Python将处理所有缓存等,所以不需要一次将整个内容加载到内存中。



以下是创建货架文件的方法。请注意,货架键必须是字符串。另请注意,我正在创建货架,而不是首先创建一个 dict 并搁置它。这样你就可以避免造成这个巨大内存中的成本,这个巨大的内存 dict 首先引起问题。

 从contextlib导入关闭
import shelve

def makedict(shelf):
#将真正的dict生成代码放在这里,显然
为我的范围(500000);
shelf [str(i)] = i

with closing(shelve.open('mydict.shelf','c'))as shelf:
makedict(shelf)

要使用它,实际上不会读入;将它作为磁盘架:

 从contextlib导入关闭
import shelve

关闭(shelve.open('mydict.shelf'))作为d:
#把你所有的实际工作放在这里。
print len(d)

如果您的字典使用代码不适合一个范围,用语句替换语句,一个简单的打开,并显式地关闭它完成后。



pickle 可能不是一个好主意,因为你还是要把整个事情看成记忆。比引入定义一个巨大文字的模块,但仍然有一个内存中的哈希表,巨大的仍然可能是一个问题,它可能会使用更少的瞬态内存,也许是磁盘空间。但是,您可以随时测试它,看看它的效果如何。



以下是创建pickle文件的方法。请注意,您可以使用(几乎)任何您想要的键作为键,而不仅仅是字符串。但是,您必须先构建整个 dict ,然后才能 pickle

  import cPickle 

def makedict():
#将真正的dict生成代码放在这里,显然
返回{i:i for range in range(500000)}

with open('mydict.pickle','wb')as f:
cPickle.dump(d,f,-1 )

这将创建一个47MB的文件。



现在,在你的主要应用程序中使用它:

  import cPickle 

def loaddict()
with open('mydict.pickle','rb')as f:
return cPickle.load(f)

pickle 相同的基本问题是要保存和加载任何其他持久性格式,无论您自己写什么自定义,或者像JSON或YAML这样的标准。 (当然,如果你需要与其他程序的互操作性,特别是在其他语言中,像JSON这样的方式就可以了。)你最好使用数据库;唯一的问题是什么样的数据库。



anydbm 类型数据库的优点是可以使用它似乎是一个 dict ,而不用担心如何加载/保存/访问它(除了打开关闭行)。 anydbm 的问题是它只允许您将字符串映射到字符串。



搁置模块有效地包裹 anydbm ,每个值都是酸洗的。你的钥匙还必须是弦,但你的价值几乎可以是任何东西。因此,只要你的键是字符串,并且你没有从外部对象的任何引用的引用,它是一个非常透明的替换为 dict 。 / p>

其他选项 - sqlite3 ,各种现代nosql数据库等,都要求您更改访问数据的方式,甚至组织它的方式。 (列表列表不是一个明确的ER列表)。从长远来看,这可能会导致更好的设计,所以如果你认为你真的应该使用关系模型,那么考虑一下这个想法。 p>




从评论中,@ekta想让我解释为什么一些限制 dbm shelve 存在。



首先, dbm 可以追溯到70年代。一个数据库可以将8位字符串简单有效地映射到字符串,这是一个非常大的交易。将所有类型的值存储为其字符串表示法也是很常见的,或者如果不是,则只存储在本机上本地表示值的字节。 (XML,JSON或甚至字节交换可能对于一天中的机器来说太贵了,或至少是当天的想法)。



扩展 dbm 处理其他数据类型的值不难。他们从来不需要进行散列或比较,只是存储和检索无损。由于 pickle 可以处理各种各样的类型,并不是太可怕,效率低下,而且附带Python,使用 pickle ,所以 shelve 确实如此。



但是键是一个不同的故事。您需要一个不仅无损可逆的编码,而且还可以确保当且仅当实际相等时,两个值才能编码为相等的字节。请记住,在Python中, 1 == True ,但显然 pickle.dumps(1)!= pickle.dumps(True) b'1'!= b'True'等。



有很多如果您只关心该类型,那么可以无损地平等地将类型转换为字节。例如,对于Unicode字符串,只需使用UTF-8。 (实际上, shelve 为你照顾那个。)对于32位有符号整数,使用 struct.pack('> I') 。对于三个字符串的元组,编码为UTF-8,反斜杠转义,并使用换行符加入。等等。对于许多特定的领域,有一个简单的答案;没有通用的答案适用于大多数域名。



所以,如果你想使用一个 dbm 要使用三个UTF-8字符串的元组作为键,您可以在 dbm (或 shelve )。与stdlib中的许多模块一样, shelve 旨在是有用的示例代码以及可用的功能,这就是为什么文档有一个指向。足够简单,一个新手应该能够弄清楚如何分叉它,将其子类化,或者包装它来做自己的自定义密钥编码。 (请注意,如果包裹 shelve ,则必须将自定义值编码为 str ,以便它可以对 str to bytes ;如果你分叉它,或者将其子类化并覆盖相关方法,你可以直接编码字节 -eg, struct.pack 调用以上,这可能更好的简单性/可读性和性能。


I have a big dictionary(28 MB) 'MyDict' stored in a MyDict.py file.

If I execute the statement:

from MyDict import MyDict

A MemoryError exception is thrown.

How can I access this dictionary using cPickle or shelve modules.

How can I write this MyDict.py file to cPickle or shelve without accessing MyDict.

This MyDict is generated by writing into a file. Here is a key-value pair from the dictionary:

{"""ABCD""" : [[(u'2011-03-21', 35.5, 37.5, 35.3, 35.85, 10434.0, 35.85), (u'2012-03-03', 86.0, 87.95, 85.55, 86.2, 30587.0, 86.2), (u'2011-03-23', 36.9, 36.9, 35.25, 36.1, 456.0, 36.1)],
    [(u'2011-03-18', 37.0, 38.0, 36.5, 36.5, 861.0, 36.5), (u'2012-03-03', 86.0, 87.95, 85.55, 86.2, 30587.0, 86.2), (u'2011-03-21', 35.5, 37.5, 35.3, 35.85, 10434.0, 35.85)],
    [(u'2011-03-16', 37.0, 37.9, 36.3, 36.7, 3876.0, 36.7), (u'2012-03-03', 86.0, 87.95, 85.55, 86.2, 30587.0, 86.2), (u'2011-03-21', 35.5, 37.5, 35.3, 35.85, 10434.0, 35.85)],
    [(u'2010-12-09', 40.5, 41.95, 36.3, 36.75, 42943.0, 36.75), (u'2011-10-26', 67.95, 71.9, 66.45, 70.35, 180812.0, 70.35), (u'2011-03-21', 35.5, 37.5, 35.3, 35.85, 10434.0, 35.85)],
    [(u'2009-01-16', 14.75, 15.0, 14.0, 14.15, 14999.0, 14.05), (u'2010-01-11', 50.0, 52.8, 49.0, 50.95, 174826.0, 50.95), (u'2009-01-27', 14.3, 15.0, 13.9, 14.15, 3862.0, 14.15)]]}

解决方案

shelve is actually a pretty good choice here. It acts just like a dictionary, but it's backed by a BDB (or similar) key-value database file, and Python will handle all the caching, etc., so it doesn't need to load the whole thing into memory at once.

Here's how to create the shelve file. Note that shelf keys have to be strings. Also note that I'm creating the shelf in-place, rather than first creating a dict and shelving it. That way you avoid the cost of having to build that giant in-memory dict that was causing problems in the first place.

from contextlib import closing
import shelve

def makedict(shelf):
    # Put the real dict-generating code here, obviously
    for i in range(500000);
        shelf[str(i)] = i

with closing(shelve.open('mydict.shelf', 'c')) as shelf:
    makedict(shelf)

And to use it, don't actually read it in; leave it as an on-disk shelf:

from contextlib import closing
import shelve

with closing(shelve.open('mydict.shelf')) as d:
    # Put all your actual work here.
    print len(d)

If your dictionary-using code doesn't fit easily into a scope, replace the with statement with a plain open, and explicitly close it when you're done.

pickle is probably not as good of an idea, because you still have to read the whole thing into memory. It will probably use a lot less transient memory, and maybe disk space, than importing a module that defines a giant literal, but still, having an in-memory hash table that huge could still be a problem. But you can always test it and see how well it works.

Here's how to create the pickle file. Note that you can use (nearly) anything you want as a key, not just strings. However, you have to build the whole dict before you can pickle it.

import cPickle

def makedict():
    # Put the real dict-generating code here, obviously
    return {i:i for i in range(500000)}

with open('mydict.pickle', 'wb') as f:
    cPickle.dump(d, f, -1)

This creates a 47MB file.

Now, to use it in your main app:

import cPickle

def loaddict():
    with open('mydict.pickle', 'rb') as f:
        return cPickle.load(f)

The same basic problems with pickle go for any other persistence format that has to be saved and loaded—whether something custom that you write yourself, or something standard like JSON or YAML. (Of course if you need interoperability with other programs, especially in other languages, something like JSON is the way to go.) You're better off with a database; the only question is, what kind of database.

The advantage of an anydbm type database is that you can use it as if it were a dict, without worrying about how to load/save/access it (other than the open and close lines). The problem with anydbm is that it only lets you map strings to strings.

The shelve module effectively wraps anydbm, with pickling of each value. Your keys still have to be strings, but your values can be almost anything. So as long as your keys are strings, and you don't have any references from the values to external objects, it's a pretty transparent drop-in replacement for a dict.

The other options—sqlite3, various modern nosql databases, etc.—require you to change the way you access data, and even the way you organize it. (A "list of lists" isn't a clear ER model.) Of course in the long run, this might result in a better design, so if you think you really should be using a relational model, consider this idea.


From the comments, @ekta wanted me to explain why some of the restrictions on dbm and shelve exist.

First, dbm goes back to the 70s. A database that could map 8-bit strings to strings simply and efficiently was a pretty huge deal back then. It was also pretty common for values of all kinds to be stored as their string representation—or, if not that, then just storing the bytes that happen to represent the value natively on the current machine. (XML, JSON, or even endianness-swapping may have been too expensive for the machines of the day, or at least the thinking of the day.)

Extending dbm to handle other data types for the values isn't hard. They never need to be hashed or compared, just stored and retrieved losslessly. Since pickle can handle a very wide variety of types, isn't too horribly inefficient, and comes with Python, it makes sense to use pickle for that, so shelve does exactly that.

But the keys are a different story. You need an encoding that's not only losslessly reversible, but also ensures that two values will encode to equal bytes if and only if they're actually equal. Keep in mind that in Python, 1 == True, but obviously pickle.dumps(1) != pickle.dumps(True), b'1' != b'True', etc.

There are plenty of types that can be losslessly and equality-preservingly converted to bytes if you only care about that type. For example, for Unicode strings, just use UTF-8. (Actually, shelve takes care of that one for you.) For 32-bit signed integers, use struct.pack('>I'). For tuples of three strings, encode to UTF-8, backslash-escape, and join them with newlines. And so on. For many specific domains, there's an easy answer; there's just no general-purpose answer that works for most domains.

So, if you want to use a dbm to use, say, tuples of three UTF-8 strings as keys, you can write your own wrapper around dbm (or shelve). As with many modules in the stdlib, shelve is intended to be helpful sample code as well as a usable feature, which is why the docs have a link to the source. It's simple enough that a novice should be able to figure out how to either fork it, subclass it, or wrap it to do his own custom key encoding. (Note that if you wrap shelve, you will have to encode your custom values to str just so it can encode that str to bytes; if you fork it, or subclass it and override the relevant methods, you can instead encode directly to bytes—e.g., that struct.pack call above. This may be better for both simplicity/readability and performance.)

这篇关于如何存储大字典?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆