在内存中加载大型字典的巨大内存使用量 [英] Huge memory usage of loading large dictionaries in memory

查看:25
本文介绍了在内存中加载大型字典的巨大内存使用量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在磁盘上有一个只有 168MB 的文件.它只是一个逗号分隔的单词列表,id.这个词可以是 1-5 个字符长.有 650 万行.

我在 python 中创建了一个字典来将其加载到内存中,以便我可以根据该单词列表搜索传入的文本.当 python 将它加载到内存中时,它显示使用了 1.3 GB 的 RAM 空间.知道这是为什么吗?

所以假设我的 word 文件看起来像这样...

1,word12、字23、word3

然后再加上 650 万.然后我遍历该文件并创建一个字典(python 2.6.1):

def load_term_cache():""" 将从我们的缓存文件中加载术语缓存而不是命中 mysql.如果没有预加载到内存中,每个进程将有 20 多万次查询"""全局 cached_termsdumpfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt')f = 打开(转储文件)缓存 = csv.reader(f)对于 term_id,缓存中的术语:cached_terms[term] = term_idf.close()

只是这样做会破坏记忆.我查看活动监视器,它将内存与所有可用内存挂钩,最多可达 1.5GB 左右的 RAM 在我的笔记本电脑上,它刚刚开始交换.任何想法如何使用 python 最有效地将键/值对存储在内存中?

更新:我尝试使用 anydb 模块,但在 440 万条记录后它就死了浮点数是我尝试加载后经过的秒数

56.95340001860.12360001963.27380002066.43400002169.59420002272.75440002383.424600024168.614800025338.57

你可以看到它运行得很好.每隔几秒插入 200,000 行,直到我撞到墙上,时间加倍.

import anydbm我=0标记=0开始时间 = time.time()dbfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms')db = anydbm.open(dbfile, 'c')#从现有的baseterm文件加载termfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt.LARGE')对于打开的行(termfile):我 += 1件 = line.split(',')db[str(pieces[1])] = str(pieces[0])如果我>标记:打印我打印轮(时间.时间() - 开始时间,2)标记 = i + 200000db.close()

解决方案

很多想法.但是,如果您需要实际帮助,请编辑您的问题以显示您的所有代码.还请告诉我们显示已使用内存的它"是什么、加载零条目文件时显示的内容、您使用的平台以及 Python 的版本.

你说这个词可以长 1-5 个词".BYTES 中关键字段的平均长度是多少?id 都是整数吗?如果是这样,最小和最大整数是多少?如果不是,如果 ID 以字节为单位的平均长度是多少?要启用以上所有内容的交叉检查,您的 6.5M 行文件中有多少字节?

看看你的代码,一个单行文件 word1,1 将创建一个 dict d['1'] = 'word1' ...不是那个低音包?

更新 3:更多问题:单词"是如何编码的?您确定您没有在两个字段中的任何一个上携带大量尾随空格吗?

更新 4 ...您询问了如何使用 python 最有效地将键/值对存储在内存中"并且没有人准确地回答这个问题.

您有一个 168 Mb 的文件,有 650 万行.那是 168 * 1.024 ** 2/6.5 = 每行 27.1 字节.去掉逗号的 1 个字节和换行符的 1 个字节(假设它是一个 *x 平台),我们每行剩下 25 个字节.假设id"是唯一的,并且它看起来是一个整数,那么我们假设id"是 7 个字节长;这使我们的单词"平均大小为 18 个字节.这符合您的预期吗?

因此,我们希望在内存查找表中存储一个 18 字节的键和一个 7 字节的值.

让我们假设一个 32 位 CPython 2.6 平台.

<预><代码>>>>K = sys.getsizeof('123456789012345678')>>>V = sys.getsizeof('1234567')>>>K、V(42, 31)

注意 sys.getsizeof(str_object) =>24 + len(str_object)

一位回答者提到了元组.请注意以下事项:

<预><代码>>>>sys.getsizeof(())28>>>sys.getsizeof((1,))32>>>sys.getsizeof((1,2))36>>>sys.getsizeof((1,2,3))40>>>sys.getsizeof(("foo", "bar"))36>>>sys.getsizeof(("foooooooooooooooooooooo", "bar"))36>>>

结论:sys.getsizeof(tuple_object) =>28 + 4 * len(tuple_object) ...它只允许指向每个项目的指针,它不允许项目的大小.

对列表的类似分析表明 sys.getsizeof(list_object) =>36 + 4 * len(list_object) ... 再次需要添加项目的大小.还有一个进一步的考虑:CPython 过度分配列表,这样它就不必在每次 list.append() 调用时调用系统 realloc().对于足够大的大小(比如 650 万!),过度分配是 12.5%——请参阅源代码 (Objects/listobject.c).这种过度分配不是用元组完成的(它们的大小不会改变).

以下是用于基于内存的查找表的 dict 的各种替代方案的成本:

元组列表:

对于 2-tuple 本身,每个元组将占用 36 个字节,加上 K 和 V 的内容.所以他们中的 N 将取 N * (36 + K + V);那么你需要一个列表来保存它们,所以我们需要 36 + 1.125 * 4 * N.

元组列表总数:36 + N * (40.5 + K + v)

这是 26 + 113.5 * N(大约 709 MB,当是 650 万时)

两个平行列表:

(36 + 1.125 * 4 * N + K * N) + (36 + 1.125 * 4 * N + V * N)即 72 + N * (9 + K + V)

请注意,当 N 为 650 万时,40.5 * N 和 9 * N 之间的差异约为 200MB.

存储为 int 而不是 str 的值:

但这还不是全部.如果 ID 实际上是整数,我们可以这样存储它们.

<预><代码>>>>sys.getsizeof(1234567)12

对于每个值对象,这是 12 个字节而不是 31 个字节.当 N 为 650 万时,19 * N 的差异进一步节省了约 118MB.

使用 array.array('l') 而不是列表作为(整数)值:

我们可以将那些 7 位整数存储在一个 array.array('l') 中.没有 int 对象,也没有指向它们的指针——只是一个 4 字节的有符号整数值.奖励:数组仅被过度分配了 6.25%(对于大 N).所以这是 1.0625 * 4 * N 而不是之前的 (1.125 * 4 + 12) * N,进一步节省了 12.25 * N,即 76 MB.

所以我们减少到 709 - 200 - 118 - 76 = 大约 315 MB.

注意错误和遗漏除外——在我的 TZ 中是 0127 :-(

I have a file on disk that's only 168MB. It's just a comma separated list of word,id. The word can be 1-5 characters long. There's 6.5 million lines.

I created a dictionary in python to load this up into memory so I can search incoming text against that list of words. When python loads it up into memory it shows 1.3 GB's of RAM space used. Any idea why that is?

So let's say my word file looks like this...

1,word1
2,word2
3,word3

Then add 6.5 million to that. I then loop through that file and create a dictionary (python 2.6.1):

def load_term_cache():
    """will load the term cache from our cached file instead of hitting mysql. If it didn't
    preload into memory it would be 20+ million queries per process"""
    global cached_terms
    dumpfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt')
    f = open(dumpfile)
    cache = csv.reader(f)
    for term_id, term in cache:
        cached_terms[term] = term_id
    f.close()

Just doing that blows up the memory. I view activity monitor and it pegs the memory to all available up to around 1.5GB of RAM On my laptop it just starts to swap. Any ideas how to most efficiently store key/value pairs in memory with python?

Update: I tried to use the anydb module and after 4.4 million records it just dies the floating point number is the elapsed seconds since I tried to load it

56.95
3400018
60.12
3600019
63.27
3800020
66.43
4000021
69.59
4200022
72.75
4400023
83.42
4600024
168.61
4800025
338.57

You can see it was running great. 200,000 rows every few seconds inserted until I hit a wall and time doubled.

import anydbm

i=0
mark=0
starttime = time.time()
dbfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms')
db = anydbm.open(dbfile, 'c')
#load from existing baseterm file
termfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt.LARGE')
for line in open(termfile):
    i += 1
    pieces = line.split(',')
    db[str(pieces[1])] = str(pieces[0])
    if i > mark:
        print i
        print round(time.time() - starttime, 2)
        mark = i + 200000
db.close()

解决方案

Lots of ideas. However, if you want practical help, edit your question to show ALL of your code. Also tell us what is the "it" that shows memory used, what it shows when you load a file with zero entries, and what platform you are on, and what version of Python.

You say that "the word can be 1-5 words long". What is the average length of the key field in BYTES? Are the ids all integer? If so what are the min and max integer? If not, what is the average length if ID in bytes? To enable cross-achecking of all of above, how many bytes are there in your 6.5M-line file?

Looking at your code, a 1-line file word1,1 will create a dict d['1'] = 'word1' ... isn't that bassackwards?

Update 3: More questions: How is the "word" encoded? Are you sure you are not carrying a load of trailing spaces on any of the two fields?

Update 4 ... You asked "how to most efficiently store key/value pairs in memory with python" and nobody's answered that yet with any accuracy.

You have a 168 Mb file with 6.5 million lines. That's 168 * 1.024 ** 2 / 6.5 = 27.1 bytes per line. Knock off 1 byte for the comma and 1 byte for the newline (assuming it's a *x platform) and we're left with 25 bytes per line. Assuming the "id" is intended to be unique, and as it appears to be an integer, let's assume the "id" is 7 bytes long; that leaves us with an average size of 18 bytes for the "word". Does that match your expectation?

So, we want to store an 18-byte key and a 7-byte value in an in-memory look-up table.

Let's assume a 32-bit CPython 2.6 platform.

>>> K = sys.getsizeof('123456789012345678')
>>> V = sys.getsizeof('1234567')
>>> K, V
(42, 31)

Note that sys.getsizeof(str_object) => 24 + len(str_object)

Tuples were mentioned by one answerer. Note carefully the following:

>>> sys.getsizeof(())
28
>>> sys.getsizeof((1,))
32
>>> sys.getsizeof((1,2))
36
>>> sys.getsizeof((1,2,3))
40
>>> sys.getsizeof(("foo", "bar"))
36
>>> sys.getsizeof(("fooooooooooooooooooooooo", "bar"))
36
>>>

Conclusion: sys.getsizeof(tuple_object) => 28 + 4 * len(tuple_object) ... it only allows for a pointer to each item, it doesn't allow for the sizes of the items.

A similar analysis of lists shows that sys.getsizeof(list_object) => 36 + 4 * len(list_object) ... again it is necessary to add the sizes of the items. There is a further consideration: CPython overallocates lists so that it doesn't have to call the system realloc() on every list.append() call. For sufficiently large size (like 6.5 million!) the overallocation is 12.5 percent -- see the source (Objects/listobject.c). This overallocation is not done with tuples (their size doesn't change).

Here are the costs of various alternatives to dict for a memory-based look-up table:

List of tuples:

Each tuple will take 36 bytes for the 2-tuple itself, plus K and V for the contents. So N of them will take N * (36 + K + V); then you need a list to hold them, so we need 36 + 1.125 * 4 * N for that.

Total for list of tuples: 36 + N * (40.5 + K + v)

That's 26 + 113.5 * N (about 709 MB when is 6.5 million)

Two parallel lists:

(36 + 1.125 * 4 * N + K * N) + (36 + 1.125 * 4 * N + V * N) i.e. 72 + N * (9 + K + V)

Note that the difference between 40.5 * N and 9 * N is about 200MB when N is 6.5 million.

Value stored as int not str:

But that's not all. If the IDs are actually integers, we can store them as such.

>>> sys.getsizeof(1234567)
12

That's 12 bytes instead of 31 bytes for each value object. That difference of 19 * N is a further saving of about 118MB when N is 6.5 million.

Use array.array('l') instead of list for the (integer) value:

We can store those 7-digit integers in an array.array('l'). No int objects, and no pointers to them -- just a 4-byte signed integer value. Bonus: arrays are overallocated by only 6.25% (for large N). So that's 1.0625 * 4 * N instead of the previous (1.125 * 4 + 12) * N, a further saving of 12.25 * N i.e. 76 MB.

So we're down to 709 - 200 - 118 - 76 = about 315 MB.

N.B. Errors and omissions excepted -- it's 0127 in my TZ :-(

这篇关于在内存中加载大型字典的巨大内存使用量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆