内存优化的Python技巧 [英] Python tips for memory optimization

查看:188
本文介绍了内存优化的Python技巧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要优化应用程序的RAM使用率.
请避免让我的讲座告诉我在编写Python时我不应该在意内存.我有一个内存问题,因为我使用了很大的默认字典(是的,我也想很快).我当前的内存消耗为350MB,并且还在不断增长.我已经不能使用共享主机了,如果我的Apache打开更多进程,内存将增加两倍和三倍……这很昂贵.
我已经进行了广泛的分析,而且我确切地知道了问题所在.
我有几个带有Unicode键的大型(> 100K条目)字典.字典从140字节开始并迅速增长,但是更大的问题是密钥. Python优化了内存中的字符串(或者我已经读过),以便查找可以是ID比较("intern"它们).不确定unicode字符串也是如此(我无法'intern'它们).
存储在字典中的对象是元组列表(an_object,一个int,一个int).

my_big_dict [some_unicode_string] .append((my_object,an_int,another_int))

我已经发现拆分成几本字典是值得的,因为元组占用了大量空间...
我发现我可以通过在将字符串用作键之前对字符串进行哈希处理来节省RAM! 但是,可悲的是,我在32位系统上遇到了生日冲突. (旁边的问题:我可以在32位系统上使用64位密钥字典吗?)

在Linux(生产)和Windows上均使用Python 2.6.5. 关于优化字典/列表/元组的内存使用的任何技巧? 我什至想到使用C-我不在乎这段很小的代码是否丑陋.这只是一个单一的位置.

提前谢谢!

我建议以下内容:将所有值存储在DB中,并保留带有字符串哈希作为键的内存中字典.如果发生冲突,请从数据库中获取值,否则(大多数情况下)使用字典.实际上,它将是一个巨大的缓存.

Python词典的一个问题是它们占用大量空间:即使是int-int词典,在32位系统上每个键值对也使用 45-80字节.同时,array.array('i')每对int仅使用 8个字节,并且通过少量记账就可以实现一个相当快的基于数组的 int→int 字典.

一旦您实现了int-int字典的内存高效实现,请将您的 string→(对象,int,int)字典拆分为三个字典,并使用哈希代替完整的字符串.您将获得一个 int→对象和两个 int→int 字典.模拟 int→object 字典,如下所示:保留对象列表并将对象的索引存储为 int→int 字典的值.

我确实意识到要获得基于数组的字典需要涉及大量的编码.我遇到了与您类似的问题,并且实现了一个相当快,内存效率很高的通用hash-int字典. 这是我的代码 (BSD许可证).它是基于数组的(每对8个字节),它负责密钥散列和冲突检查,在写入过程中使数组(实际上是几个较小的数组)保持有序,并在读取时进行二进制搜索.您的代码简化为:

dictionary = HashIntDict(checking = HashIntDict.CHK_SHOUTING)
# ...
database.store(k, v)
try:
    dictionary[k] = v
except CollisionError:
    pass
# ...
try:
    v = dictionary[k]
except CollisionError:
    v = database.fetch(k)

checking参数指定发生冲突时发生的情况:CHK_SHOUTING在读取和写入时提高CollisionErrorCHK_DELETING在读取时返回None而在写入时保持静音,CHK_IGNORING不执行碰撞检查.

下面是对我的实现的简短描述,欢迎使用优化提示!顶层数据结构是数组的常规字典.每个数组最多包含2^16 = 65536个整数对(2^32的平方根).键k和对应的值v都存储在第k/65536个数组中.数组按需初始化,并通过键保持顺序.在每次读取和写入时都执行二进制搜索.冲突检查是一种选择.如果启用,尝试覆盖现有键的操作将从字典中删除键和关联的值,将键添加到一组冲突键中,并(再次,可选地)引发异常.

I need to optimize the RAM usage of my application.
PLEASE spare me the lectures telling me I shouldn't care about memory when coding Python. I have a memory problem because I use very large default-dictionaries (yes, I also want to be fast). My current memory consumption is 350MB and growing. I already cannot use shared hosting and if my Apache opens more processes the memory doubles and triples... and it is expensive.
I have done extensive profiling and I know exactly where my problems are.
I have several large (>100K entries) dictionaries with Unicode keys. A dictionary starts at 140 bytes and grows fast, but the bigger problem is the keys. Python optimizes strings in memory (or so I've read) so that lookups can be ID comparisons ('interning' them). Not sure this is also true for unicode strings (I was not able to 'intern' them).
The objects stored in the dictionary are lists of tuples (an_object, an int, an int).

my_big_dict[some_unicode_string].append((my_object, an_int, another_int))

I already found that it is worth while to split to several dictionaries because the tuples take a lot of space...
I found that I could save RAM by hashing the strings before using them as keys! But then, sadly, I ran into birthday collisions on my 32 bit system. (side question: is there a 64-bit key dictionary I can use on a 32-bit system?)

Python 2.6.5 on both Linux(production) and Windows. Any tips on optimizing memory usage of dictionaries / lists / tuples? I even thought of using C - I don't care if this very small piece of code is ugly. It is just a singular location.

Thanks in advance!

解决方案

I suggest the following: store all the values in a DB, and keep an in-memory dictionary with string hashes as keys. If a collision occurs, fetch values from the DB, otherwise (vast majority of the cases) use the dictionary. Effectively, it will be a giant cache.

A problem with dictionaries in Python is that they use a lot of space: even an int-int dictionary uses 45-80 bytes per key-value pair on a 32-bit system. At the same time, an array.array('i') uses only 8 bytes per a pair of ints, and with a little bit of bookkeeping one can implement a reasonably fast array-based int → int dictionary.

Once you have a memory-efficient implementation of an int-int dictionary, split your string → (object, int, int) dictionary into three dictionaries and use hashes instead of full strings. You'll get one int → object and two int → int dictionaries. Emulate the int → object dictionary as follows: keep a list of objects and store indexes of the objects as values of an int → int dictionary.

I do realize there's a considerable amount of coding involved to get an array-based dictionary. I had had problem similar to yours and I have implemented a reasonably fast, very memory-efficient, generic hash-int dictionary. Here's my code (BSD license). It is array-based (8 bytes per pair), it takes care of key hashing and collision checking, it keeps the array (several smaller arrays, actually) ordered during writes and does binary search on reads. Your code is reduced to something like:

dictionary = HashIntDict(checking = HashIntDict.CHK_SHOUTING)
# ...
database.store(k, v)
try:
    dictionary[k] = v
except CollisionError:
    pass
# ...
try:
    v = dictionary[k]
except CollisionError:
    v = database.fetch(k)

The checking parameter specifies what happens when a collision occurs: CHK_SHOUTING raises CollisionError on reads and writes, CHK_DELETING returns None on reads and remains silent on writes, CHK_IGNORING doesn't do collision checking.

What follows is a brief description of my implementation, optimization hints are welcome! The top-level data structure is a regular dictionary of arrays. Each array contains up to 2^16 = 65536 integer pairs (square root of 2^32). A key k and a corresponding value v are both stored in k/65536-th array. The arrays are initialized on-demand and kept ordered by keys. Binary search is executed on each read and write. Collision checking is an option. If enabled, an attempt to overwrite an already existing key will remove the key and associated value from the dictionary, add the key to a set of colliding keys, and (again, optionally) raise an exception.

这篇关于内存优化的Python技巧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆