为什么字典在Python中使用这么多的RAM [英] Why does a dictionary use so much RAM in Python

查看:130
本文介绍了为什么字典在Python中使用这么多的RAM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个python脚本,读取两个文件的内容,第一个是相对较小的文件(〜30KB),第二个是较大的文件〜270MB。两个文件的内容都被加载到字典数据结构中。当第二个文件加载时,我预计RAM的大小要等于磁盘上的文件大小,或许有一些开销,但是看到我的电脑上的RAM使用情况似乎一直在〜2GB左右8倍的文件大小)。相关源代码在下面(暂停插入,所以我可以看到每个阶段的RAM使用情况)。消耗大量内存的行是tweets = map(json.loads,tweet_file):

  def get_scores(term_file) :
全局分数
在term_file中的行:
term,score = line.split(\t)#tab character
scores [term] = int(score)

def pause():
tmp = raw_input('按任意键继续:')

def main():
#获取术语
打印'打开单词列表文件...'
term_file = open(sys.argv [1])$ ​​b $ b pause()
打印'创建字典从单词列表文件...'
get_scores(term_file)
pause()
打印'关闭单词列表文件...'
term_file.close
pause()

#从文件中获取tweets ...
打印打开tweets文件...
tweet_file = open(sys.argv [2])
pause )
打印'从单词列表文件创建字典...'
tweets = map(json.loads,tweet_file)#crea是一个字典列表(每个鸣叫一个)
pause()
打印'关闭tweets文件...'
tweet_file.close
pause()

有谁知道这是为什么?我的关注是我想把我的研究扩展到更大的文件,但是会很快的用完内存。有趣的是,打开文件后,内存使用量似乎并没有显着增加(因为我认为这只是创建一个指针)。



我有一个想法,一次尝试循环访问文件一行,并处理我可以只存储我需要的最小值,以供将来参考,而不是加载所有内容成为一个字典的列表,但我只是想看看,当创建字典时,大约8倍的文件大小乘以内存是否符合其他人的经验?

解决方案

我的猜测是否在您的词典中有多个副本同时存储在内存中(以各种格式)。例如,

  tweets = map(json.loads,tweet_file)#创建一个字典列表(一个tweet)

将创建一个新的副本(+ 400〜字典开销)。但是您原来的 tweet_file 留在内存中。为什么这么大的数字?那么,如果你使用Unicode字符串,每个Unicode字符在内存中使用2或4个字节。而在您的文件中,假设UTF-8编码,大多数字符只能使用1个字节。如果在Python 2中使用纯字符串,内存中字符串的大小应与磁盘上的大小几乎相同。所以你必须找到一个其他的解释。



编辑: Python 2中字符占用的实际字节数可能会有所不同。以下是一些示例:

 >>>导入sys 
>>>> sys.getsizeof()
40
>>> sys.getsizeof(a)
41
>>> sys.getsizeof(ab)
42

如你所见,它出现,每个字符被编码为一个字节。但是:

 >>> sys.getsizeof(à)
42

不适用于法语字符。和...

 >>> sys.getsizeof(世)
43
>>> sys.getsizeof(世界)
46

对于日语,我们有3个字节字符。



以上结果与站点相关 - 由我的系统使用UTF-8的默认编码来解释。上面计算出的字符串大小实际上是表示给定文本的字节字符串大小



如果'json.load'使用unicode字符串,结果有所不同:

 >>> sys.getsizeof(u)
52
>>> sys.getsizeof(ua)
56
>>> sys.getsizeof(uab)
60
>>> sys.getsizeof(u世)
56
>>> sys.getsizeof(u世界)
60

在这种情况下,看到,每个额外的字符增加4个额外的字节。






也许文件对象将缓存一些数据?如果要触发对象的显式删除,请尝试将其引用设置为None:

  tweets = map(json.loads ,tweet_file)#创建一个字典列表(每个tweet一个)
[...]
tweet_file.close()
tweet_file =无

当对对象的引用不再有任何引用时,Python会将其删除,从而释放对应的内存(从Python堆 - 我不认为内存被返回到系统)。


I have written a python script that read the contents of two files, the first is a relatively small file (~30KB) and the second is a larger file ~270MB. The contents of both files are loaded into a dictionary data structure. When the second file is loaded I would have expected the amount of RAM required to be roughly equivalent to the size of the file on disk, perhaps with some overhead, but watching the RAM usage on my PC it seems to consistently take ~2GB (around 8 times the size of the file). The relevant source code is below (pauses inserted just so I can see the RAM usage at each stage). The line consuming large amounts of memory is "tweets = map(json.loads, tweet_file)":

def get_scores(term_file):
    global scores
    for line in term_file:
        term, score  = line.split("\t") #tab character
        scores[term] = int(score)

def pause():
    tmp = raw_input('press any key to continue: ')

def main():
    # get terms and their scores..
    print 'open word list file ...'
    term_file = open(sys.argv[1])
    pause()
    print 'create dictionary from word list file ...'
    get_scores(term_file)
    pause()
    print 'close word list file ...'
    term_file.close
    pause()

    # get tweets from file...
    print 'open tweets file ...'
    tweet_file = open(sys.argv[2])
    pause()
    print 'create dictionary from word list file ...'
    tweets = map(json.loads, tweet_file) #creates a list of dictionaries (one per tweet)
    pause()
    print 'close tweets file ...'
    tweet_file.close
    pause()

Does anyone know why this is? My concern is that I would like to extend my research to larger files, but will fast run out of memory. Interestingly, the memory usage does not seem to increase noticeably after opening the file (as I think this just creates a pointer).

I have an idea to try looping through the file one line at a time and processing what I can and only storing the minimum that I need for future reference rather than loading everything into a list of dictionaries, but I was just interested to see if the approx 8 times multiplier on file size to memory when creating a dictionary is in line with other peoples experience?

解决方案

My guess is you have multiple copies on your dictionnary simultaneously stored in memory (on various format). As an example, the line:

tweets = map(json.loads, tweet_file) #creates a list of dictionaries (one per tweet)

Will create a new copy (+400~1000MB incl. dictionary the overhead). But your original tweet_file stay in memory. Why such big numbers? Well, if you work with Unicode strings, each Unicode character use 2 or 4 bytes in memory. Whereas on your file, assuming UTF-8 encoding, most of the characters use only 1 byte. If you working with plain strings in Python 2 the size of the string in memory should be almost the same as the size on the disk. So you will have to find an other explanation.

EDIT: The actual number of bytes occupied by a "character" in Python 2 may vary. Here are some example:

>>> import sys
>>> sys.getsizeof("")
40
>>> sys.getsizeof("a")
41
>>> sys.getsizeof("ab")
42

As you see, it appears that each character is encoded as one byte. But:

>>> sys.getsizeof("à")
42

Not for "French" characters. And ...

>>> sys.getsizeof("世")
43
>>> sys.getsizeof("世界")
46

For Japanese, we have 3 bytes per character.

The above results are site dependent -- and are explained by the fact that my system use 'UTF-8' a default encoding. The "size of the string" calculated just above are in fact the "size of the byte string" representing the given text.

If 'json.load' use "unicode" strings, the result are somehow different:

>>> sys.getsizeof(u"")
52
>>> sys.getsizeof(u"a")
56
>>> sys.getsizeof(u"ab")
60
>>> sys.getsizeof(u"世")
56
>>> sys.getsizeof(u"世界")
60

In that case, as you can see, each extra character add 4 extra bytes.


Maybe file object will cache some data? If you want to trigger explicit dellaocation of an object, try to set its reference to None:

tweets = map(json.loads, tweet_file) #creates a list of dictionaries (one per tweet)
[...]
tweet_file.close()
tweet_file = None

When there is not longer any reference to an object, Python will dellocate it -- and so free the corresponding memory (from the Python heap -- I don't think the memory is returned to the system).

这篇关于为什么字典在Python中使用这么多的RAM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆