Python 字典的内存高效替代品 [英] Memory Efficient Alternatives to Python Dictionaries

查看:18
本文介绍了Python 字典的内存高效替代品的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我当前的一个副项目中,我正在浏览一些文本,查看单词三元组的频率.在我第一次尝试时,我使用了三层深的默认字典.换句话说,topDict[word1][word2][word3] 返回这些词在文本中出现的次数,topDict[word1][word2] 返回一个字典出现在单词 1 和 2 之后的所有单词,等等.

In one of my current side projects, I am scanning through some text looking at the frequency of word triplets. In my first go at it, I used the default dictionary three levels deep. In other words, topDict[word1][word2][word3] returns the number of times these words appear in the text, topDict[word1][word2] returns a dictionary with all the words that appeared following words 1 and 2, etc.

这可以正常运行,但会占用大量内存.在我最初的测试中,它使用的内存大约是将三元组存储在文本文件中的 20 倍,这似乎是过多的内存开销.

This functions correctly, but it is very memory intensive. In my initial tests it used something like 20 times the memory of just storing the triplets in a text file, which seems like an overly large amount of memory overhead.

我怀疑这些词典中的许多创建的插槽比实际使用的插槽多得多,所以我想用其他一些在这种方式下使用时内存效率更高的东西来替换词典.我强烈希望有一个解决方案,允许沿着字典的行进行键查找.

My suspicion is that many of these dictionaries are being created with many more slots than are actually being used, so I want to replace the dictionaries with something else that is more memory efficient when used in this manner. I would strongly prefer a solution that allows key lookups along the lines of the dictionaries.

根据我对数据结构的了解,使用红黑或 AVL 之类的平衡二叉搜索树可能是理想的,但我真的不想自己实现它们.如果可能,我更愿意坚持使用标准的 Python 库,但如果其他替代方案效果最佳,我绝对愿意接受.

From what I know of data structures, a balanced binary search tree using something like red-black or AVL would probably be ideal, but I would really prefer not to implement them myself. If possible, I'd prefer to stick with standard python libraries, but I'm definitely open to other alternatives if they would work best.

那么,有人对我有什么建议吗?

So, does anyone have any suggestions for me?

编辑添加:

感谢到目前为止的回复.到目前为止,有一些答案建议使用元组,当我将前两个词压缩成一个元组时,这对我来说并没有太大作用.我不愿将所有三个词都用作关键字,因为我希望在给定前两个词的情况下可以轻松查找所有第三个词.(即我想要类似于 topDict[word1, word2].keys() 的结果).

Thanks for the responses so far. A few of the answers so far have suggested using tuples, which didn't really do much for me when I condensed the first two words into a tuple. I am hesitant to use all three as a key since I want it to be easy to look up all third words given the first two. (i.e. I want something like the result of topDict[word1, word2].keys()).

我正在使用的当前数据集是最新版本的 学校维基百科.例如,解析前 1000 页的结果对于文本文件大约是 11MB,其中每行是三个单词,计数全部由制表符分隔.以我现在使用的字典格式存储文本大约需要 185MB.我知道指针和诸如此类的东西会有一些额外的开销,但差异似乎过大.

The current dataset I am playing around with is the most recent version of Wikipedia For Schools. The results of parsing the first thousand pages, for example, is something like 11MB for a text file where each line is the three words and the count all tab separated. Storing the text in the dictionary format I am now using takes around 185MB. I know that there will be some additional overhead for pointers and whatnot, but the difference seems excessive.

推荐答案

一些测量.我拿了 10MB 的免费电子书文本并计算了 trigram 频率,生成了一个 24MB 的文件.将它存储在不同的简单 Python 数据结构中占用了这么多以 kB 为单位的空间,以运行 ps 的 RSS 来衡量,其中 d 是一个字典,keys 和 freqs 是列表,而 a,b,c,freq 是三元组记录的字段:

Some measurements. I took 10MB of free e-book text and computed trigram frequencies, producing a 24MB file. Storing it in different simple Python data structures took this much space in kB, measured as RSS from running ps, where d is a dict, keys and freqs are lists, and a,b,c,freq are the fields of a trigram record:

295760     S. Lott's answer
237984     S. Lott's with keys interned before passing in
203172 [*] d[(a,b,c)] = int(freq)
203156     d[a][b][c] = int(freq)
189132     keys.append((a,b,c)); freqs.append(int(freq))
146132     d[intern(a),intern(b)][intern(c)] = int(freq)
145408     d[intern(a)][intern(b)][intern(c)] = int(freq)
 83888 [*] d[a+' '+b+' '+c] = int(freq)
 82776 [*] d[(intern(a),intern(b),intern(c))] = int(freq)
 68756     keys.append((intern(a),intern(b),intern(c))); freqs.append(int(freq))
 60320     keys.append(a+' '+b+' '+c); freqs.append(int(freq))
 50556     pair array
 48320     squeezed pair array
 33024     squeezed single array

标记为 [*] 的条目没有有效的方法来查找一对 (a,b);它们被列出只是因为其他人推荐了它们(或它们的变体).(我有点恼火,因为投票最多的答案没有帮助,如表所示.)

The entries marked [*] have no efficient way to look up a pair (a,b); they're listed only because others have suggested them (or variants of them). (I was sort of irked into making this because the top-voted answers were not helpful, as the table shows.)

'Pair array' 是我原始答案中的以下方案(我会从带有键的数组开始是前两个词..."),其中每对的值表是表示为单个字符串.'压缩对数组'是一样的,省略等于 1 的频率值(最常见的案件).压缩单个数组"类似于压缩对数组,但将键和值组合为一个字符串(带有分隔符).压缩后的单数组代码:

'Pair array' is the scheme below in my original answer ("I'd start with the array with keys being the first two words..."), where the value table for each pair is represented as a single string. 'Squeezed pair array' is the same, leaving out the frequency values that are equal to 1 (the most common case). 'Squeezed single array' is like squeezed pair array, but gloms key and value together as one string (with a separator character). The squeezed single array code:

import collections

def build(file):
    pairs = collections.defaultdict(list)
    for line in file:  # N.B. file assumed to be already sorted
        a, b, c, freq = line.split()
        key = ' '.join((a, b))
        pairs[key].append(c + ':' + freq if freq != '1' else c)
    out = open('squeezedsinglearrayfile', 'w')
    for key in sorted(pairs.keys()):
        out.write('%s|%s
' % (key, ' '.join(pairs[key])))

def load():
    return open('squeezedsinglearrayfile').readlines()

if __name__ == '__main__':
    build(open('freqs'))

我还没有编写代码来从这个结构中查找值(使用 bisect,如下所述),也没有实现下面描述的更高级的压缩结构.

I haven't written the code to look up values from this structure (use bisect, as mentioned below), or implemented the fancier compressed structures also described below.

原始答案: 一个简单的字符串排序数组,每个字符串都是一个以空格分隔的单词串联,使用 bisect 模块搜索,应该值得一试.这样就节省了指针等的空间,还是会因为单词的重复而浪费空间;有一个标准的技巧可以去除常见的前缀,并使用另一个级别的索引来恢复它们,但这更复杂,速度也更慢.(这个想法是以必须按顺序扫描的压缩形式存储数组的连续块,以及每个块的随机访问索引.块足够大以进行压缩,但足够小以实现合理的访问时间.特定的压缩此处适用的方案:如果连续条目是hello george"和hello world",则将第二个条目改为6world".(6 是共同前缀的长度.)或者您可以使用 zlib?无论如何,您可以通过以下方式了解更多信息查找全文搜索中使用的字典结构.)所以具体来说,我将从数组开始,键是前两个词,并行数组的条目列出可能的第三个词及其频率.不过,它可能仍然很糟糕——我认为就包含电池的内存高效选项而言,你可能不走运.

Original answer: A simple sorted array of strings, each string being a space-separated concatenation of words, searched using the bisect module, should be worth trying for a start. This saves space on pointers, etc. It still wastes space due to the repetition of words; there's a standard trick to strip out common prefixes, with another level of index to get them back, but that's rather more complex and slower. (The idea is to store successive chunks of the array in a compressed form that must be scanned sequentially, along with a random-access index to each chunk. Chunks are big enough to compress, but small enough for reasonable access time. The particular compression scheme applicable here: if successive entries are 'hello george' and 'hello world', make the second entry be '6world' instead. (6 being the length of the prefix in common.) Or maybe you could get away with using zlib? Anyway, you can find out more in this vein by looking up dictionary structures used in full-text search.) So specifically, I'd start with the array with keys being the first two words, with a parallel array whose entries list the possible third words and their frequencies. It might still suck, though -- I think you may be out of luck as far as batteries-included memory-efficient options.

此外,此处推荐二叉树结构以提高内存效率.例如,这篇论文在一个类似的问题(尽管是 unigrams 而不是 trigrams)并找到一个哈希表来通过该度量击败所有树结构.

Also, binary tree structures are not recommended for memory efficiency here. E.g., this paper tests a variety of data structures on a similar problem (unigrams instead of trigrams though) and finds a hashtable to beat all of the tree structures by that measure.

我应该像其他人一样提到排序数组只能用于词表,而不是二元词或三元词;然后对于你的真实"数据结构,无论它是什么,你都使用整数键而不是字符串——索引到单词列表中.(但这使您无法利用除单词表本身之外的常见前缀.也许我毕竟不应该建议这样做.)

I should have mentioned, as someone else did, that the sorted array could be used just for the wordlist, not bigrams or trigrams; then for your 'real' data structure, whatever it is, you use integer keys instead of strings -- indices into the wordlist. (But this keeps you from exploiting common prefixes except in the wordlist itself. Maybe I shouldn't suggest this after all.)

这篇关于Python 字典的内存高效替代品的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆