许多词典使用大量RAM [英] Many dictionaries using massive amounts of RAM
问题描述
我有一个非常简单的Python脚本来创建(出于测试目的)列表中的3500万个字典对象.每个字典对象包含两个键/值对.例如.
I have a very simple Python script to create (for test purposes), 35 million dictionary objects within a list. Each dictionary object contains two key/value pairs. eg.
{'Name': 'Jordan', 'Age': 35}
该脚本非常简单地查询姓名和年龄,搜索字典列表,然后返回一个包含所有匹配字典条目索引的新列表.
The script very simply take a query on name and age, searches through the list of dictionaries and returns a new list containing the index of all matching dictionary entries.
但是,正如您在下面看到的那样,消耗了大量的内存.我想我在某个地方犯了一个非常天真的错误.
However as you can see below, an insane amount of memory is consumed. I presume I am making a very naive mistake somewhere.
我的代码如下:(如果可读性更好,也可以在图像中查看).
My code is as follows: (can also be viewed in the image if more readable).
import sys
# Firstly, we will create 35 million records in memory, all will be the same apart from one
def search(key, value, data, age):
print("Searching, please wait")
# Create list to store returned PKs
foundPKS = []
for index in range(0, len(data)):
if key in data[index] and 'Age' in data[index]:
if data[index][key] == value and data[index]['Age'] >= age:
foundPKS.append(index)
results = foundPKS
return results
def createdata():
# Let's create our list for storing our dictionaries
print("Creating database, please wait")
dictList = []
for index in range(0, 35000000):
# Define dictionary
record = {'Name': 'Jordan', 'Age': 25}
if 24500123 <= index <= 24500200:
record['Name'] = 'Chris'
record['Age'] = 33
# Add the dict to a list
dictList.append(record)
return dictList
datareturned = createdata()
keyname = input("For which key do you wish to search?")
valuename = input("Which values do you want to find?")
valueage = input("What is the minimum age?")
print("Full data set object size:" + str(sys.getsizeof(datareturned)))
results = search(keyname, valuename, datareturned, int(valueage))
if len(results) > 0:
print(str(len(results)) + " found. Writing to results.txt")
fo = open("results.txt", "w")
for line in range(0, len(results)):
fo.write(str(results[line]) + "\n")
fo.close()
是什么原因导致RAM大量消耗?
What is causing the massive consumption of RAM?
推荐答案
dict
对象的开销非常大.它取决于您的Python版本和系统架构,但取决于Python 3.5 64bit
The overhead for a dict
object is quite large. It depends on your Python version and your system architechture, but on Python 3.5 64bit
In [21]: sys.getsizeof({})
Out[21]: 288
所以猜测:
250*36e6*1e-9 == 9.0
因此,如果我创建了那么多词典(而不考虑list
的话),这是我
So that is a lower limit on my ram usage in gigabytes if I created that many dictionaries, not factoring in the list
!
使用namedtuple
而不是将dict用作记录类型(这实际上不是用例).
Rather than use a dict as a record type, which isn't really the use case, use a namedtuple
.
要了解它的比较方式,让我们建立一个等效的元组列表:
And to get a view of how this compares, let's set up an equivalent list of tuples:
In [23]: Record = namedtuple("Record", "name age")
In [24]: records = [Record("john", 28) for _ in range(36000000)]
In [25]: getsizeof = sys.getsizeof
考虑:
In [31]: sum(getsizeof(record)+ getsizeof(record.name) + getsizeof(record.age) for record in records)
Out[31]: 5220000000
In [32]: _ + getsizeof(records)
Out[32]: 5517842208
In [33]: _ * 1e-9
Out[33]: 5.517842208
因此,5个演出是一个非常保守的上限.例如,假设没有小整数缓存正在进行,这对于 ages 的记录类型将是完全重要的.在我自己的系统上,python进程正在注册2.7 gig的内存使用量(通过top
).
So 5 gigs is an upper limit that is quite conservative. For example, it assumes that there is no small-int caching going on, which for a record-type of ages will totally matter. On my own system, the python process is registering 2.7 gigs of memory usage (via top
).
因此,我的机器上实际发生的情况可以更好地建模,即对字符串假设保守-唯一的字符串平均大小为10,因此没有字符串interning-但对于int则宽松,假设int-caching为为我们照顾int
对象,所以我们只需要担心8字节指针!
So, what is actually going on in my machine is better modeled by being conservative for strings assuming -- unique strings that have an average size of 10, so no string interning -- but liberal for ints, assuming int-caching is taking care of our int
objects for us, so we just have to worry about the 8-byte pointers!
In [35]: sum(getsizeof("0123456789") + 8 for record in records)
Out[35]: 2412000000
In [36]: _ + getsizeof(records)
Out[36]: 2709842208
In [37]: _ * 1e-9
Out[37]: 2.709842208
根据我从top
观察到的内容,这是一个很好的模型.
Which is a good model for what I'm observing from top
.
现在,如果您真的想将数据塞入ram中,那么您将不得不失去Python的灵活性.您可以将array
模块与struct
结合使用,以获得类似于C的内存效率.相反,更容易涉足的领域可能是numpy
,它允许类似的事情.例如:
Now, if you really want to cram data into ram, you are going to have to lose the flexibility of Python. You could use the array
module in combination with struct
, to get C-like memory efficiency. An easier world to wade into might be numpy
instead, which allows for similar things. For example:
In [1]: import numpy as np
In [2]: recordtype = np.dtype([('name', 'S20'),('age', np.uint8)])
In [3]: records = np.empty((36000000), dtype=recordtype)
In [4]: records.nbytes
Out[4]: 756000000
In [5]: records.nbytes*1e-9
Out[5]: 0.756
注意,我们现在可以变得非常紧凑.我可以使用8位无符号整数(即单个字节)来表示年龄.但是,我立即面临一些灵活性:如果我想有效地存储字符串,则必须定义一个最大大小.我使用了'S20'
,它是20个字符.这些是ASCII字节,但是20个ascii字符的字段可能足以满足名称要求.
Note, we are now allowed to be quite compact. I can use 8-bit unsigned integers (i.e. a single byte) to represent age. However, immediately I am faced with some inflexibility: if I want efficient storage of strings I must define a maximum size. I've used 'S20'
, which is 20 characters. These are ASCII bytes, but a field of 20 ascii characters might very well suffice for names.
现在,numpy
为您提供了许多包装C编译代码的快速方法.因此,为了解决这个问题,让我们用一些玩具数据填充我们的记录.名称将只是一个简单的数字字符串,年龄将从正态分布中选择,正态分布的平均值为50,标准差为10.
Now, numpy
gives you a lot of fast methods wrapping C-compiled code. So, just to play around with it, let's fill our records with some toy data. Names will simply be string of digits from a simple count, and age will be selected from a normal distribution with a mean of 50 and a standard deviation of 10.
In [8]: for i in range(1, 36000000+1):
...: records['name'][i - 1] = b"%08d" % i
...:
In [9]: import random
...: for i in range(36000000):
...: records['age'][i] = max(0, int(random.normalvariate(50, 10)))
...:
现在,我们可以使用numpy来查询records
.例如,如果要在某些条件下记录的索引,请使用np.where
:
Now, we can use numpy to query our records
. For example, if you want the indices of your records given some condition, use np.where
:
In [10]: np.where(records['age'] > 70)
Out[10]: (array([ 58, 146, 192, ..., 35999635, 35999768, 35999927]),)
In [11]: idx = np.where(records['age'] > 70)[0]
In [12]: len(idx)
Out[12]: 643403
因此,643403
记录的年龄为> 70
.现在,让我们尝试100
:
So 643403
records that have an age > 70
. Now, let's try 100
:
In [13]: idx = np.where(records['age'] > 100)[0]
In [14]: len(idx)
Out[14]: 9
In [15]: idx
Out[15]:
array([ 2315458, 5088296, 5161049, 7079762, 15574072, 17995993,
25665975, 26724665, 28322943])
In [16]: records[idx]
Out[16]:
array([(b'02315459', 101), (b'05088297', 102), (b'05161050', 101),
(b'07079763', 104), (b'15574073', 101), (b'17995994', 102),
(b'25665976', 101), (b'26724666', 102), (b'28322944', 101)],
dtype=[('name', 'S20'), ('age', 'u1')])
当然,一个主要的不灵活性是numpy
数组的大小是 .调整大小的操作很昂贵.现在,您可以将numpy.array
包装在某个类中,它将充当有效的主干,但是到那时,您还可以使用真实的数据库.幸运的是,Python随附了sqlite
.
Of course, one major inflexibility is that numpy
arrays are sized. Resizing operations are expensive. Now, you could maybe wrap a numpy.array
in some class and it will act as an efficient backbone, but at that point, you might as well use a real data-base. Lucky for you, Python comes with sqlite
.
这篇关于许多词典使用大量RAM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!