内存高效的Python词典替代方法 [英] Memory Efficient Alternatives to Python Dictionaries

查看:193
本文介绍了内存高效的Python词典替代方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我目前的一个项目中,我正在扫描一些文字,查看单词三元组的频率。在我第一次去,我使用默认字典三级深。换句话说,topDictionary [word1] [word2] [word3]返回这些单词出现在文本中的次数,topdictionary [word1] [word2]返回一个字典,其中包含单词1和2之后出现的所有单词等。

In one of my current side projects, I am scanning through some text looking at the frequency of word triplets. In my first go at it, I used the default dictionary three levels deep. In other words, topDictionary[word1][word2][word3] returns the number of times these words appear in the text, topdictionary[word1][word2] returns a dictionary with all the words that appeared following words 1 and 2, etc.

这个功能正常,但它非常内存密集。在我的初始测试中,它使用了将存储三元组存储在文本文件中的内存的20倍,这似乎是一个过多的内存开销。

This functions correctly, but it is very memory intensive. In my initial tests it used something like 20 times the memory of just storing the triplets in a text file, which seems like an overly large amount of memory overhead.

我的怀疑其中许多这些字典正在创造出比实际使用的更多的插槽,所以我想用这种方式使用更高内存效率的其他内容替换字典。我强烈希望能够按照字典的键进行查询的解决方案。

My suspicion is that many of these dictionaries are being created with many more slots than are actually being used, so I want to replace the dictionaries with something else that is more memory efficient when used in this manner. I would strongly prefer a solution that allows key lookups along the lines of the dictionaries.

从我所知道的数据结构中,使用类似red-黑色或AVL可能是理想的,但我真的不希望自己实现。如果可能的话,我更喜欢坚持使用标准的Python图书馆,但是如果他们能最好的工作,我肯定会开放给其他的选择。

From what I know of data structures, a balanced binary search tree using something like red-black or AVL would probably be ideal, but I would really prefer not to implement them myself. If possible, I'd prefer to stick with standard python libraries, but I'm definitely open to other alternatives if they would work best.

所以,有没有人有对于我的建议?

So, does anyone have any suggestions for me?

编辑添加:

感谢您的回复。到目前为止,有几个答案建议使用元组,当我将前两个词汇集成一个元组时,这对我来说并没有太大的帮助。我犹豫使用这三个键作为关键,因为我希望能够容易地查找前两者的所有第三个字。 (即我想要像topDict [word1,word2] .keys())的结果。

Thanks for the responses so far. A few of the answers so far have suggested using tuples, which didn't really do much for me when I condensed the first two words into a tuple. I am hesitant to use all three as a key since I want it to be easy to look up all third words given the first two. (ie I want something like the result of topDict[word1,word2].keys() ).

我正在玩的当前数据集是最新版本的rel =noreferrer>维基百科为学校。例如,解析一千页的结果对于一个文本文件是像11MB,其中每行是三个单词,并且count个tab分隔。以字典格式存储文本我现在使用大约需要185MB。我知道这个指针会有一些额外的开销,但是差异似乎过大了。

The current dataset I am playing around with is the most recent version of Wikipedia For Schools. The results of parsing the first thousand pages, for example, is something like 11MB for a text file where each line is the three words and the count all tab separated. Storing the text in the dictionary format I am now using takes around 185MB. I know that there will be some additional overhead for pointers and whatnot, but the difference seems excessive.

再次感谢大家迄今为止的回应。 >

Once again, thank you all for the responses so far.

推荐答案

一些测量。我拿了10MB的免费电子书文本和计算的三角形频率,产生一个24MB的文件。将其存储在不同的简单Python数据结构中,在kB中占用了大量空间,以运行ps的RSS为单位,其中d是一个dict,keys和freq是列表,a,b,c,freq是三元组记录的字段:

Some measurements. I took 10MB of free e-book text and computed trigram frequencies, producing a 24MB file. Storing it in different simple Python data structures took this much space in kB, measured as RSS from running ps, where d is a dict, keys and freqs are lists, and a,b,c,freq are the fields of a trigram record:

295760     S. Lott's answer
237984     S. Lott's with keys interned before passing in
203172 [*] d[(a,b,c)] = int(freq)
203156     d[a][b][c] = int(freq)
189132     keys.append((a,b,c)); freqs.append(int(freq))
146132     d[intern(a),intern(b)][intern(c)] = int(freq)
145408     d[intern(a)][intern(b)][intern(c)] = int(freq)
 83888 [*] d[a+' '+b+' '+c] = int(freq)
 82776 [*] d[(intern(a),intern(b),intern(c))] = int(freq)
 68756     keys.append((intern(a),intern(b),intern(c))); freqs.append(int(freq))
 60320     keys.append(a+' '+b+' '+c); freqs.append(int(freq))
 50556     pair array
 48320     squeezed pair array
 33024     squeezed single array

标记为[*]的条目没有有效的方法来查找一对(a,b);他们被列出,因为其他人建议他们(或它们的变体)。 (我感到很兴奋,因为最高评价的答案没有帮助,如表所示。)

The entries marked [*] have no efficient way to look up a pair (a,b); they're listed only because others have suggested them (or variants of them). (I was sort of irked into making this because the top-voted answers were not helpful, as the table shows.)

对阵列是我下面的方案原始答案(我将从数组开始键
作为前两个单词...),其中每对的值表为
表示为单个字符串。 挤压对阵列是相同的,
不包括等于1的频率值(最常见的
情况)。 '挤压的单个数组'就像挤压对数组,但是将键和键合在一起作为一个字符串(带有一个分隔符)。压缩的单个数组代码:

'Pair array' is the scheme below in my original answer ("I'd start with the array with keys being the first two words..."), where the value table for each pair is represented as a single string. 'Squeezed pair array' is the same, leaving out the frequency values that are equal to 1 (the most common case). 'Squeezed single array' is like squeezed pair array, but gloms key and value together as one string (with a separator character). The squeezed single array code:

import collections

def build(file):
    pairs = collections.defaultdict(list)
    for line in file:  # N.B. file assumed to be already sorted
        a, b, c, freq = line.split()
        key = ' '.join((a, b))
        pairs[key].append(c + ':' + freq if freq != '1' else c)
    out = open('squeezedsinglearrayfile', 'w')
    for key in sorted(pairs.keys()):
        out.write('%s|%s\n' % (key, ' '.join(pairs[key])))

def load():
    return open('squeezedsinglearrayfile').readlines()

if __name__ == '__main__':
    build(open('freqs'))

我没有编写代码来查找这个结构的值(使用二分法,如下所述),或者实现了也描述的鸽友压缩结构

I haven't written the code to look up values from this structure (use bisect, as mentioned below), or implemented the fancier compressed structures also described below.

原始答案:一个简单的排序的字符串数组,每个字符串都是空格分隔的字串,使用二等分模块搜索应该值得尝试开始。这可以节省指针等空间。由于重复的话,它仍然浪费空间;有一个标准的技巧来删除常见的前缀,另一个级别的索引可以让他们回来,但这更复杂和更慢。 (这个想法是将数组的连续块存储为必须按顺序扫描的压缩格式,以及每个块的随机访问索引。块大到足以压缩,但是足够小以达到合理的访问时间。特定的压缩方案适用于:如果连续条目为hello george和hello world,则使第二个条目为6world(6为前缀的长度。)或者也可以使用 zlib ?无论如何,您可以通过以下方式了解更多信息:查找全文搜索中使用的字典结构。)因此,具体来说,我将从数组开始,键是前两个字,其中一个并行数组的条目列出了可能的第三个字和它们的频率。可能还是会吸吮,但是,我认为你可能没有运气,只要电池包含内存高效的选项。

Original answer: A simple sorted array of strings, each string being a space-separated concatenation of words, searched using the bisect module, should be worth trying for a start. This saves space on pointers, etc. It still wastes space due to the repetition of words; there's a standard trick to strip out common prefixes, with another level of index to get them back, but that's rather more complex and slower. (The idea is to store successive chunks of the array in a compressed form that must be scanned sequentially, along with a random-access index to each chunk. Chunks are big enough to compress, but small enough for reasonable access time. The particular compression scheme applicable here: if successive entries are 'hello george' and 'hello world', make the second entry be '6world' instead. (6 being the length of the prefix in common.) Or maybe you could get away with using zlib? Anyway, you can find out more in this vein by looking up dictionary structures used in full-text search.) So specifically, I'd start with the array with keys being the first two words, with a parallel array whose entries list the possible third words and their frequencies. It might still suck, though -- I think you may be out of luck as far as batteries-included memory-efficient options.

另外,二进制树结构是>不推荐用于记忆效率。例如,本文可以测试各种数据结构类似的问题(unigrams而不是三元组虽然),并找到一个散列表,以击败所有的树结构的措施。

Also, binary tree structures are not recommended for memory efficiency here. E.g., this paper tests a variety of data structures on a similar problem (unigrams instead of trigrams though) and finds a hashtable to beat all of the tree structures by that measure.

我应该提到,像其他人一样,排序的数组可以仅用于wordlist,而不是doublerams或trigrams;那么对于你的'真实'数据结构,无论是什么,你使用整数键而不是字符串 - 索引到wordlist。 (但是这样可以避免使用常用的前缀,除了在wordlist本身之外,也许我根本不应该提出这一点。)

I should have mentioned, as someone else did, that the sorted array could be used just for the wordlist, not bigrams or trigrams; then for your 'real' data structure, whatever it is, you use integer keys instead of strings -- indices into the wordlist. (But this keeps you from exploiting common prefixes except in the wordlist itself. Maybe I shouldn't suggest this after all.)

这篇关于内存高效的Python词典替代方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆