Python:减少字典的内存使用 [英] Python: Reducing memory usage of dictionary

查看:496
本文介绍了Python:减少字典的内存使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图加载几个文件到内存中。这些文件具有以下3种格式:




  • string TAB int

  • / li>
  • int TAB float。



确实,它们是ngram静态文件,以防这种情况对解决方案有帮助。例如:

  i_love TAB 10 
love_you TAB 12

目前,我现在正在做的伪代码是

  loadData(file):
data = {}
文件中的行:
first,second = line.split('\t')
data [first] = int(second)#or float(second)

返回数据

令我感到惊讶的是,虽然磁盘中文件的总大小约为21 mb,但当加载到内存中时,进程需要120 - 180 mb的内存! (整个python应用程序不会将任何其他数据加载到内存中)。



有少于10个文件,大多数将保持稳定在大约50-80k行,除了一个文件目前有几百万行。



所以我想要一个技术/数据结构来减少内存消耗:




  • 有关压缩技术的任何建议吗?

  • 如果我仍然使用dict,有没有办法减少内存?是否可以像Java for Python dict中一样设置负载因子?

  • 如果您有其他一些数据结构,也愿意交换一些速度来减少内存。然而,这是一个时间敏感的应用程序,所以一旦用户输入他们的查询,我认为采取超过几秒钟来返回结果是不合理的。关于这一点,我仍然感到惊讶的是,谷歌如何管理做谷歌翻译这么快:他们必须使用很多技术+大量的服务器的力量。



非常感谢。我期待着您的建议。

解决方案

我不能提供一个完整的策略来帮助改善内存占用,



如果你看看字典的 Python实现(这是一个相对简单的实现散列表),以及内置字符串和整数数据类型的实现,例如这里是(具体是object.h,intobject.h,stringobject.h和dictobject.h,以及../Objects中对应的* .c文件),你可以用一些精度计算预期的空间要求:


  1. 整数是固定大小的对象,即它包含引用计数,类型指针和实际整数,通常在32位系统上至少12字节,在64位系统上 24字节,不考虑通过对齐可能丢失的额外空间。


  2. 字符串对象是可变大小的,这意味着它包含




    • 引用计数

    • 键入指针

    • 懒惰计算的哈希码

    • 状态信息(例如用于 字符串)

    • 指向动态内容的指针



    <在64位上,总共至少24个字节,在32位或 60个字节不包括字符串本身的空格。 / li>
  3. 字典本身包含多个值区,每个值都包含




    • 当前存储的对象的哈希码(由于使用了冲突解决策略,因此无法根据桶的位置预测)

    • 指向键对象的指针

    • 指向值对象的指针



    总共至少12个字节


  4. 字典以 8个空桶开头,


>我进行了一个测试,包含46,461个唯一字符串(337,670字节连接字符串大小)的列表,每个都与一个整数相关联。类似于您的设置,在32位机器上。根据上面的计算,我希望最小的内存占用量




  • 46,461 *(24 + 12)字节= /整数组合

  • 337,670 = 0.3 MB的字符串内容
  • 65,536 * 12字节= 1.6 MB的哈希桶(调整大小13次后)



共2.65 MB。 (对于64位系统,相应的计算产生5.5 MB)



当运行Python解释器空闲时,它的占位符根据 ps -tool是4.6 MB。因此,创建字典后的总预期内存消耗大约为4.6 + 2.65 = 7.25 MB。 真实内存占用(根据 ps 在我的测试中是 7.6 MB。 0.35 MB是通过Python的内存分配策略(内存区域等)产生的开销消耗的。



当然很多人现在会指出,我使用 ps 来测量内存占用是不准确的,我对32位和64位系统上的指针类型和整数的大小的假设在许多特定系统上可能是错误的。已批准。



但是,主要结论,我相信这些是:




  • Python 字典实施会消耗惊人的内存量。

  • 由许多 int 和(特别是)字符串对象,引用计数,预先计算的哈希码等,超过了您先前的想法

  • 只要你使用Python,并希望将字符串和整数表示为单独的对象,几乎不是一种避免内存开销的方法

  • 这可能是值得的(或实现自己)一个 Python-C扩展实现一个哈希,将键和值存储为C指针(而不是Python对象)。我不知道是否存在;但我相信可以做到,并且可以将内存占用减少一半以上。


I'm trying to load a couple of files into the memory. The files have either of the following 3 formats:

  • string TAB int
  • string TAB float
  • int TAB float.

Indeed, they are ngram statics files, in case this helps with the solution. For instance:

i_love TAB 10
love_you TAB 12

Currently, the pseudocode of I'm doing right now is

loadData(file):
     data = {}
     for line in file:
        first, second = line.split('\t')
        data[first] = int(second) #or float(second)

     return data

To much of my surprise, while the total size of the files in disk is about 21 mb, when loaded into memory the process takes 120 - 180 mb of memory! (the whole python application doesn't load any other data into memory).

There are less than 10 files, most of them would stay stable at about 50-80k lines, except for one file which currently has millions of lines.

So I would like to ask for a technique/data structure to reduce the memory consumption:

  • Any advice for compression techniques?
  • If I still use dict, is there any way to reduce the memory? Is it possible to set the "load factor" as in Java for Python dict?
  • If you have some other data structures, 'm also willing to trade some of the speed to reduce the memory. Nevertheless, this is a time sensitive application so that once the users input their queries, I think it'd be not quite reasonable to take more than a few seconds to return the result. With regard to this, I'm still amazed by how Google manage to do the Google Translate so fast: they must be using a lot of techniques + lots of servers' power?

Thank you very much. I look forward to your advice.

解决方案

I cannot offer a complete strategy that would help improve memory footprint, but I believe it may help to analyse what exactly is taking so much memory.

If you look at the Python implementation of dictionary (which is a relatively straight-forward implementation of a hash table), as well as the implementation of the built-in string and integer data types, for example here (specifically object.h, intobject.h, stringobject.h and dictobject.h, as well as the corresponding *.c files in ../Objects), you can calculate with some accuracy the expected space requirements:

  1. An integer is a fixed-sized object, i.e. it contains a reference count, a type pointer and the actual integer, in total typically at least 12 bytes on a 32bit system and 24 bytes on a 64bit system, not taking into account extra space possibly lost through alignment.

  2. A string object is variable-sized, which means it contains

    • reference count
    • type pointer
    • size information
    • space for the lazily calculated hash code
    • state information (e.g. used for interned strings)
    • a pointer to the dynamic content

    in total at least 24 bytes on 32bit or 60 bytes on 64bit, not including space for the string itself.

  3. The dictionary itself consists of a number of buckets, each containing

    • the hash code of the object currently stored (that is not predictable from the position of the bucket due to the collision resolution strategy used)
    • a pointer to the key object
    • a pointer to the value object

    in total at least 12 bytes on 32bit and 24 bytes on 64bit.

  4. The dictionary starts out with 8 empty buckets and is resized by doubling the number of entries whenever its capacity is reached.

I carried out a test with a list of 46,461 unique strings (337,670 bytes concatenated string size), each associated with an integer — similar to your setup, on a 32-bit machine. According to the calculation above, I would expect a minimum memory footprint of

  • 46,461 * (24+12) bytes = 1.6 MB for the string/integer combinations
  • 337,670 = 0.3 MB for the string contents
  • 65,536 * 12 bytes = 1.6 MB for the hash buckets (after resizing 13 times)

in total 2.65 MB. (For a 64-bit system the corresponding calculation yields 5.5 MB.)

When running the Python interpreter idle, its footprint according to the ps-tool is 4.6 MB. So the total expected memory consumption after creating the dictionary is approximately 4.6 + 2.65 = 7.25 MB. The true memory footprint (according to ps) in my test was 7.6 MB. I guess the extra ca. 0.35 MB were consumed by overhead generated through Python's memory allocation strategy (for memory arenas etc.)

Of course many people will now point out that my use of ps to measure the memory footprint is inaccurate and my assumptions about the size of pointer types and integers on 32-bit and 64-bit systems may be wrong on many specific systems. Granted.

But, nevertheless, the key conclusions, I believe, are these:

  • The Python dictionary implementation consumes a surprisingly small amount of memory
  • But the space taken by the many int and (in particular) string objects, for reference counts, pre-calculated hash codes etc., is more than you'd think at first
  • There is hardly a way to avoid the memory overhead, as long as you use Python and want the strings and integers represented as individual objects — at least I don't see how that could be done
  • It may be worthwhile to look for (or implement yourself) a Python-C extension that implements a hash that stores keys and values as C-pointers (rather than Python objects). I don't know if that exists; but I believe it could be done and could reduce the memory footprint by more than half.

这篇关于Python:减少字典的内存使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆