许多词典使用大量RAM [英] Many dictionaries using massive amounts of RAM

查看:53
本文介绍了许多词典使用大量RAM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常简单的Python脚本来创建(出于测试目的)列表中的3500万个字典对象.每个字典对象包含两个键/值对.例如.

I have a very simple Python script to create (for test purposes), 35 million dictionary objects within a list. Each dictionary object contains two key/value pairs. eg.

{'Name': 'Jordan', 'Age': 35}

该脚本非常简单地查询姓名和年龄,搜索字典列表,然后返回一个包含所有匹配字典条目索引的新列表.

The script very simply take a query on name and age, searches through the list of dictionaries and returns a new list containing the index of all matching dictionary entries.

但是,正如您在下面看到的那样,消耗了大量的内存.我想我在某个地方犯了一个非常天真的错误.

However as you can see below, an insane amount of memory is consumed. I presume I am making a very naive mistake somewhere.

我的代码如下:(如果可读性更好,也可以在图像中查看).

My code is as follows: (can also be viewed in the image if more readable).

import sys

# Firstly, we will create 35 million records in memory, all will be the same apart from one

def search(key, value, data, age):
    print("Searching, please wait")
    # Create list to store returned PKs
    foundPKS = []
    for index in range(0, len(data)):
        if key in data[index] and 'Age' in data[index]:
            if data[index][key] == value and data[index]['Age'] >= age:
                foundPKS.append(index)
    results = foundPKS
    return results

def createdata():
    # Let's create our list for storing our dictionaries
    print("Creating database, please wait")
    dictList = []
    for index in range(0, 35000000):
        # Define dictionary
        record = {'Name': 'Jordan', 'Age': 25}
        if 24500123 <= index <= 24500200:
            record['Name'] = 'Chris'
            record['Age'] = 33
        # Add the dict to a list
        dictList.append(record)
    return dictList

datareturned = createdata()

keyname = input("For which key do you wish to search?")
valuename = input("Which values do you want to find?")
valueage = input("What is the minimum age?")

print("Full data set object size:" + str(sys.getsizeof(datareturned)))
results = search(keyname, valuename, datareturned, int(valueage))

if len(results) > 0:
    print(str(len(results)) + " found. Writing to results.txt")
    fo = open("results.txt", "w")
    for line in range(0, len(results)):
        fo.write(str(results[line]) + "\n")
    fo.close()

是什么原因导致RAM大量消耗?

What is causing the massive consumption of RAM?

推荐答案

dict对象的开销非常大.它取决于您的Python版本和系统架构,但取决于Python 3.5 64bit

The overhead for a dict object is quite large. It depends on your Python version and your system architechture, but on Python 3.5 64bit

In [21]: sys.getsizeof({})
Out[21]: 288

所以猜测:

250*36e6*1e-9 == 9.0

因此,如果我创建了那么多词典(而不考虑list的话),这是我千兆字节的内存使用的下限!

So that is a lower limit on my ram usage in gigabytes if I created that many dictionaries, not factoring in the list!

使用namedtuple而不是将dict用作记录类型(这实际上不是用例).

Rather than use a dict as a record type, which isn't really the use case, use a namedtuple.

要了解它的比较方式,让我们建立一个等效的元组列表:

And to get a view of how this compares, let's set up an equivalent list of tuples:

In [23]: Record = namedtuple("Record", "name age")

In [24]: records = [Record("john", 28) for _ in range(36000000)]

In [25]: getsizeof = sys.getsizeof

考虑:

In [31]: sum(getsizeof(record)+ getsizeof(record.name) + getsizeof(record.age)  for record in records)
Out[31]: 5220000000

In [32]: _ + getsizeof(records)
Out[32]: 5517842208

In [33]: _ * 1e-9
Out[33]: 5.517842208

因此,5个演出是一个非常保守的上限.例如,假设没有小整数缓存正在进行,这对于 ages 的记录类型将是完全重要的.在我自己的系统上,python进程正在注册2.7 gig的内存使用量(通过top).

So 5 gigs is an upper limit that is quite conservative. For example, it assumes that there is no small-int caching going on, which for a record-type of ages will totally matter. On my own system, the python process is registering 2.7 gigs of memory usage (via top).

因此,我的机器上实际发生的情况可以更好地建模,即对字符串假设保守-唯一的字符串平均大小为10,因此没有字符串interning-但对于int则宽松,假设int-caching为为我们照顾int对象,所以我们只需要担心8字节指针!

So, what is actually going on in my machine is better modeled by being conservative for strings assuming -- unique strings that have an average size of 10, so no string interning -- but liberal for ints, assuming int-caching is taking care of our int objects for us, so we just have to worry about the 8-byte pointers!

In [35]: sum(getsizeof("0123456789") + 8  for record in records)
Out[35]: 2412000000

In [36]: _ + getsizeof(records)
Out[36]: 2709842208

In [37]: _ * 1e-9
Out[37]: 2.709842208

根据我从top观察到的内容,这是一个很好的模型.

Which is a good model for what I'm observing from top.

现在,如果您真的想将数据塞入ram中,那么您将不得不失去Python的灵活性.您可以将array模块与struct结合使用,以获得类似于C的内存效率.相反,更容易涉足的领域可能是numpy,它允许类似的事情.例如:

Now, if you really want to cram data into ram, you are going to have to lose the flexibility of Python. You could use the array module in combination with struct, to get C-like memory efficiency. An easier world to wade into might be numpy instead, which allows for similar things. For example:

In [1]: import numpy as np

In [2]: recordtype = np.dtype([('name', 'S20'),('age', np.uint8)])

In [3]: records = np.empty((36000000), dtype=recordtype)

In [4]: records.nbytes
Out[4]: 756000000

In [5]: records.nbytes*1e-9
Out[5]: 0.756

注意,我们现在可以变得非常紧凑.我可以使用8位无符号整数(即单个字节)来表示年龄.但是,我立即面临一些灵活性:如果我想有效地存储字符串,则必须定义一个最大大小.我使用了'S20',它是20个字符.这些是ASCII字节,但是20个ascii字符的字段可能足以满足名称要求.

Note, we are now allowed to be quite compact. I can use 8-bit unsigned integers (i.e. a single byte) to represent age. However, immediately I am faced with some inflexibility: if I want efficient storage of strings I must define a maximum size. I've used 'S20', which is 20 characters. These are ASCII bytes, but a field of 20 ascii characters might very well suffice for names.

现在,numpy为您提供了许多包装C编译代码的快速方法.因此,为了解决这个问题,让我们用一些玩具数据填充我们的记录.名称将只是一个简单的数字字符串,年龄将从正态分布中选择,正态分布的平均值为50,标准差为10.

Now, numpy gives you a lot of fast methods wrapping C-compiled code. So, just to play around with it, let's fill our records with some toy data. Names will simply be string of digits from a simple count, and age will be selected from a normal distribution with a mean of 50 and a standard deviation of 10.

In [8]: for i in range(1, 36000000+1):
   ...:     records['name'][i - 1] = b"%08d" % i
   ...:

In [9]: import random
   ...: for i in range(36000000):
   ...:     records['age'][i] = max(0, int(random.normalvariate(50, 10)))
   ...:

现在,我们可以使用numpy来查询records.例如,如果要在某些条件下记录的索引,请使用np.where:

Now, we can use numpy to query our records. For example, if you want the indices of your records given some condition, use np.where:

In [10]: np.where(records['age'] > 70)
Out[10]: (array([      58,      146,      192, ..., 35999635, 35999768, 35999927]),)

In [11]: idx = np.where(records['age'] > 70)[0]

In [12]: len(idx)
Out[12]: 643403

因此,643403记录的年龄为> 70.现在,让我们尝试100:

So 643403 records that have an age > 70. Now, let's try 100:

In [13]: idx = np.where(records['age'] > 100)[0]

In [14]: len(idx)
Out[14]: 9

In [15]: idx
Out[15]:
array([ 2315458,  5088296,  5161049,  7079762, 15574072, 17995993,
       25665975, 26724665, 28322943])

In [16]: records[idx]
Out[16]:
array([(b'02315459', 101), (b'05088297', 102), (b'05161050', 101),
       (b'07079763', 104), (b'15574073', 101), (b'17995994', 102),
       (b'25665976', 101), (b'26724666', 102), (b'28322944', 101)],
      dtype=[('name', 'S20'), ('age', 'u1')])

当然,一个主要的不灵活性是numpy数组的大小是 .调整大小的操作很昂贵.现在,您可以将numpy.array包装在某个类中,它将充当有效的主干,但是到那时,您还可以使用真实的数据库.幸运的是,Python随附了sqlite.

Of course, one major inflexibility is that numpy arrays are sized. Resizing operations are expensive. Now, you could maybe wrap a numpy.array in some class and it will act as an efficient backbone, but at that point, you might as well use a real data-base. Lucky for you, Python comes with sqlite.

这篇关于许多词典使用大量RAM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆