在内存中加载大字典的内存使用量大 [英] Huge memory usage of loading large dictionaries in memory

查看:72
本文介绍了在内存中加载大字典的内存使用量大的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的磁盘上只有168MB的文件.这只是一个逗号分隔的单词,id的列表. 该单词的长度可以为1-5个字符.有650万行.

I have a file on disk that's only 168MB. It's just a comma separated list of word,id. The word can be 1-5 characters long. There's 6.5 million lines.

我在python中创建了一个字典,将其加载到内存中,因此我可以针对该单词列表搜索传入的文本.当python将其加载到内存中时,它会显示已使用的1.3 GB RAM空间.知道为什么吗?

I created a dictionary in python to load this up into memory so I can search incoming text against that list of words. When python loads it up into memory it shows 1.3 GB's of RAM space used. Any idea why that is?

所以说我的word文件看起来像这样...

So let's say my word file looks like this...

1,word1
2,word2
3,word3

然后增加650万. 然后,我遍历该文件并创建一个字典(python 2.6.1):

Then add 6.5 million to that. I then loop through that file and create a dictionary (python 2.6.1):

def load_term_cache():
    """will load the term cache from our cached file instead of hitting mysql. If it didn't
    preload into memory it would be 20+ million queries per process"""
    global cached_terms
    dumpfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt')
    f = open(dumpfile)
    cache = csv.reader(f)
    for term_id, term in cache:
        cached_terms[term] = term_id
    f.close()

这样做只会炸毁内存.我查看活动监视器,它会将内存锁定为所有可用内存,最高可达1.5GB内存.在我的笔记本电脑上,它刚刚开始交换.有什么想法如何使用python最有效地将键/值对存储在内存中吗?

Just doing that blows up the memory. I view activity monitor and it pegs the memory to all available up to around 1.5GB of RAM On my laptop it just starts to swap. Any ideas how to most efficiently store key/value pairs in memory with python?

更新:我尝试使用anydb模块,在记录了440万条记录后,它就死了 浮点数是自尝试加载以来经过的秒数

Update: I tried to use the anydb module and after 4.4 million records it just dies the floating point number is the elapsed seconds since I tried to load it

56.95
3400018
60.12
3600019
63.27
3800020
66.43
4000021
69.59
4200022
72.75
4400023
83.42
4600024
168.61
4800025
338.57

您可以看到它运行得很好.每插入几秒钟200,000行,直到撞到墙,时间翻了一番.

You can see it was running great. 200,000 rows every few seconds inserted until I hit a wall and time doubled.

import anydbm

i=0
mark=0
starttime = time.time()
dbfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms')
db = anydbm.open(dbfile, 'c')
#load from existing baseterm file
termfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt.LARGE')
for line in open(termfile):
    i += 1
    pieces = line.split(',')
    db[str(pieces[1])] = str(pieces[0])
    if i > mark:
        print i
        print round(time.time() - starttime, 2)
        mark = i + 200000
db.close()

推荐答案

很多想法.但是,如果需要实际帮助,请编辑问题以显示所有代码.还要告诉我们什么是使用",它显示已使用的内存,在加载零项的文件时显示的内容,使用的平台以及Python的版本.

Lots of ideas. However, if you want practical help, edit your question to show ALL of your code. Also tell us what is the "it" that shows memory used, what it shows when you load a file with zero entries, and what platform you are on, and what version of Python.

您说单词长度可以为1-5个单词". BYTES中关键字段的平均长度是多少? id都是整数吗?如果是这样,最小和最大整数是多少?如果不是,则ID的平均长度(以字节为单位)是多少?要对以上所有内容进行交叉检查,6.5M行文件中有多少字节?

You say that "the word can be 1-5 words long". What is the average length of the key field in BYTES? Are the ids all integer? If so what are the min and max integer? If not, what is the average length if ID in bytes? To enable cross-achecking of all of above, how many bytes are there in your 6.5M-line file?

看看您的代码,一个1行的文件word1,1将创建一个字典d['1'] = 'word1' ... ...难道不是个笨蛋吗?

Looking at your code, a 1-line file word1,1 will create a dict d['1'] = 'word1' ... isn't that bassackwards?

更新3:更多问题:单词"是如何编码的?您确定在这两个字段中的任何一个上都没有承载尾随空格吗?

Update 3: More questions: How is the "word" encoded? Are you sure you are not carrying a load of trailing spaces on any of the two fields?

更新4 ...您问"如何使用python最有效地将键/值对存储在内存中",并且没有人以任何准确性回答过.

Update 4 ... You asked "how to most efficiently store key/value pairs in memory with python" and nobody's answered that yet with any accuracy.

您有一个168 Mb的文件,包含650万行.每行168 * 1.024 ** 2/6.5 = 27.1字节.敲掉1个字节的逗号和1个字节的换行符(假设它是一个* x平台),我们每行剩下25个字节.假设"id"是唯一的,并且看起来像是整数,则假定"id"的长度为7个字节;这样一来,字"的平均大小为18个字节.这符合您的期望吗?

You have a 168 Mb file with 6.5 million lines. That's 168 * 1.024 ** 2 / 6.5 = 27.1 bytes per line. Knock off 1 byte for the comma and 1 byte for the newline (assuming it's a *x platform) and we're left with 25 bytes per line. Assuming the "id" is intended to be unique, and as it appears to be an integer, let's assume the "id" is 7 bytes long; that leaves us with an average size of 18 bytes for the "word". Does that match your expectation?

因此,我们要在内存中的查找表中存储一个18字节的密钥和一个7字节的值.

So, we want to store an 18-byte key and a 7-byte value in an in-memory look-up table.

让我们假设一个32位CPython 2.6平台.

Let's assume a 32-bit CPython 2.6 platform.

>>> K = sys.getsizeof('123456789012345678')
>>> V = sys.getsizeof('1234567')
>>> K, V
(42, 31)

请注意sys.getsizeof(str_object) => 24 + len(str_object)

一个回答者提到了音色.请仔细注意以下几点:

Tuples were mentioned by one answerer. Note carefully the following:

>>> sys.getsizeof(())
28
>>> sys.getsizeof((1,))
32
>>> sys.getsizeof((1,2))
36
>>> sys.getsizeof((1,2,3))
40
>>> sys.getsizeof(("foo", "bar"))
36
>>> sys.getsizeof(("fooooooooooooooooooooooo", "bar"))
36
>>>

结论:sys.getsizeof(tuple_object) => 28 + 4 * len(tuple_object) ... ...仅允许指向每个项目的指针,而不允许输入项目的大小.

Conclusion: sys.getsizeof(tuple_object) => 28 + 4 * len(tuple_object) ... it only allows for a pointer to each item, it doesn't allow for the sizes of the items.

对列表的类似分析表明,sys.getsizeof(list_object) => 36 + 4 * len(list_object) ...再次需要增加项目的大小.还有一个需要考虑的问题:CPython对列表进行整体分配,因此它不必在每次list.append()调用时都调用系统realloc().对于足够大的大小(例如650万!),超额分配为12.5%-参见源(Objects/listobject.c).元组不会完成这种过度分配(它们的大小不会改变).

A similar analysis of lists shows that sys.getsizeof(list_object) => 36 + 4 * len(list_object) ... again it is necessary to add the sizes of the items. There is a further consideration: CPython overallocates lists so that it doesn't have to call the system realloc() on every list.append() call. For sufficiently large size (like 6.5 million!) the overallocation is 12.5 percent -- see the source (Objects/listobject.c). This overallocation is not done with tuples (their size doesn't change).

以下是基于内存的查找表的各种替代方法的成本:

Here are the costs of various alternatives to dict for a memory-based look-up table:

元组列表:

List of tuples:

每个元组将为两个元组本身占用36个字节,并为内容加上K和V.因此,其中N个将取N *(36 + K + V);那么您需要一个列表来保存它们,因此我们需要36 + 1.125 * 4 * N.

Each tuple will take 36 bytes for the 2-tuple itself, plus K and V for the contents. So N of them will take N * (36 + K + V); then you need a list to hold them, so we need 36 + 1.125 * 4 * N for that.

元组列表的总数:36 + N *(40.5 + K + v)

Total for list of tuples: 36 + N * (40.5 + K + v)

那是26 + 113.5 * N(大约709 MB ,当是650万时)

That's 26 + 113.5 * N (about 709 MB when is 6.5 million)

两个平行列表:

Two parallel lists:

(36 + 1.125 * 4 * N + K * N)+(36 + 1.125 * 4 * N + V * N) 即72 + N *(9 + K + V)

(36 + 1.125 * 4 * N + K * N) + (36 + 1.125 * 4 * N + V * N) i.e. 72 + N * (9 + K + V)

请注意,当N为650万时,40.5 * N和9 * N之间的差约为200MB.

Note that the difference between 40.5 * N and 9 * N is about 200MB when N is 6.5 million.

存储为int而不是str的值:

Value stored as int not str:

但这还不是全部.如果这些ID实际上是整数,我们可以将其存储为此类.

But that's not all. If the IDs are actually integers, we can store them as such.

>>> sys.getsizeof(1234567)
12

这是12个字节,而不是每个值对象31个字节.当N为650万时,相差19 * N可以进一步节省约118MB.

That's 12 bytes instead of 31 bytes for each value object. That difference of 19 * N is a further saving of about 118MB when N is 6.5 million.

使用array.array('l')代替(整数)值的列表:

Use array.array('l') instead of list for the (integer) value:

我们可以将这些7位整数存储在array.array('l')中.没有int对象,也没有指向它们的指针-只是一个4字节的带符号整数值.奖励:数组仅被超额分配6.25%(对于较大的N).因此是1.0625 * 4 * N,而不是以前的(1.125 * 4 + 12)* N,进一步节省了12.25 * N,即76 MB.

We can store those 7-digit integers in an array.array('l'). No int objects, and no pointers to them -- just a 4-byte signed integer value. Bonus: arrays are overallocated by only 6.25% (for large N). So that's 1.0625 * 4 * N instead of the previous (1.125 * 4 + 12) * N, a further saving of 12.25 * N i.e. 76 MB.

所以我们只有709-200-118-76 = 大约315 MB .

So we're down to 709 - 200 - 118 - 76 = about 315 MB.

错误和遗漏除外-在我的TZ中为0127 :-(

N.B. Errors and omissions excepted -- it's 0127 in my TZ :-(

这篇关于在内存中加载大字典的内存使用量大的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆