了解python中的内存使用情况 [英] Understanding memory usage in python

查看:79
本文介绍了了解python中的内存使用情况的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解python如何使用内存来估计我一次可以运行多少个进程.现在,我在具有大量RAM(约90-150GB的可用RAM)的服务器上处理大型文件.

I'm trying to understand how python is using memory to estimate how many processes I can run at a time. Right now I process large files on a server with large amounts of ram (~90-150GB of free RAM).

要进行测试,我将使用python进行操作,然后查看htop以查看用法.

For a test, I would do things in python, then look at htop to see what the usage was.

第1步:我打开一个2.55GB的文件并将其保存为字符串

step 1: I open a file which is 2.55GB and save it to a string

with open(file,'r') as f:
    data=f.read()

使用量为2686M

第2步:我在换行符上分割文件

step 2: I split the file on newlines

data = data.split('\n')

使用量为7476M

第3步:我仅保留第4行(我删除的三行中的两行与我保留的行的长度相等)

step 3: I keep only every 4th line (two of the three lines I remove are of equal length to the line I keep)

data=[data[x] for x in range(0,len(data)) if x%4==1]

使用量为8543M

第4步:我将其分成20个相等的块,以通过多处理池运行.

step 4:I split this into 20 equal chunks to run through a multiprocessing pool.

l=[] 
for b in range(0,len(data),len(data)/40):
    l.append(data[b:b+(len(data)/40)])

使用量为8621M

第5步:我删除数据,使用量为8496M.

step 5: I delete data, usage is 8496M.

有些事情对我来说没有意义.

There are several things that are not making sense to me.

在第二步中,当我将字符串更改为数组时,为什么内存使用量如此之高.我假设数组容器比字符串容器大得多?

In step two, why does the memory usage go up so much when I change the string into an array. I am assuming that the array containers are much larger than the string container?

在第三步中,为什么数据没有明显减少.我基本上摆脱了3/4的数组以及数组中至少2/3的数据.我希望它会相应缩小.调用垃圾收集器没有任何区别.

in step three why doesn't the data shrink significantly. I essentially got rid of 3/4 of my arrays and at least 2/3 of the data within the array. I would expect it to shrink accordingly. Calling the garbage collector did not make any difference.

当我将较小的数组分配给另一个变量时,它使用较少的内存就足够了. 使用量6605M

oddly enough when I assigned the smaller array to another variable it uses less memory. usage 6605M

当我删除旧对象data时:使用6059M

这对我来说似乎很奇怪.在缩小我的记忆足迹方面的任何帮助将不胜感激.

This seems weird to me. Any help on shrinking my memory foot print would be appreciated.

编辑

好的,这使我的头部受伤.显然,python在幕后做了一些奇怪的事情……而且只有python.我已经制作了以下脚本,以使用我的原始方法和下面的答案中建议的方法对此进行演示.数字全部以GB为单位.

Okay, this is making my head hurt. Clearly python is doing some weird things behind the scenes here... and only python. I've made following script to demonstrate this using my original method and the method suggested in the answer below. Numbers are all in GB.

测试代码

import os,sys
import psutil
process = psutil.Process(os.getpid())
import time

py_usage=process.memory_info().vms / 1000000000.0
in_file = "14982X16.fastq"

def totalsize(o):
    size = 0
    for x in o:
        size += sys.getsizeof(x)
    size += sys.getsizeof(o)
    return "Object size:"+str(size/1000000000.0)

def getlines4(f):
    for i, line in enumerate(f):
        if i % 4 == 1:
            yield line.rstrip()

def method1():
    start=time.time()
    with open(in_file,'rb') as f:
        data = f.read().split("\n")
    data=[data[x] for x in xrange(0,len(data)) if x%4==1]
    return data

def method2():
    start=time.time()
    with open(in_file,'rb') as f:
        data2=list(getlines4(f))
    return data2


print "method1 == method2",method1()==method2()
print "Nothing in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
data=method1()
print "data from method1 is in memory"
print "method1", totalsize(data)
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
del data
print "Nothing in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
data2=method2()
print "data from method2 is in memory"
print "method2", totalsize(data2)
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
del data2
print "Nothing is in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage


print "\nPrepare to have your mind blown even more!"
data=method1()
print "Data from method1 is in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
data2=method2()
print "Data from method1 and method 2 are in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
data==data2
print "Compared the two lists"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
del data
print "Data from method2 is in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
del data2
print "Nothing is in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage

输出

method1 == method2 True
Nothing in memory
Usage: 0.001798144
data from method1 is in memory
method1 Object size:1.52604683
Usage: 4.552925184
Nothing in memory
Usage: 0.001798144
data from method2 is in memory
method2 Object size:1.534815518
Usage: 1.56932096
Nothing is in memory
Usage: 0.001798144

Prepare to have your mind blown even more!
Data from method1 is in memory
Usage: 4.552925184
Data from method1 and method 2 are in memory
Usage: 4.692287488
Compared the two lists
Usage: 4.692287488
Data from method2 is in memory
Usage: 4.56169472
Nothing is in memory
Usage: 0.001798144

对于那些使用python3的人来说,它非常相似,除了在比较操作之后不那么糟糕...

for those of you using python3 its pretty similar, except not as bad after the comparison operation...

PYTHON3的输出

method1 == method2 True
Nothing in memory
Usage: 0.004395008000000006
data from method1 is in memory
method1 Object size:1.718523294
Usage: 5.322555392
Nothing in memory
Usage: 0.004395008000000006
data from method2 is in memory
method2 Object size:1.727291982
Usage: 1.872596992
Nothing is in memory
Usage: 0.004395008000000006

Prepare to have your mind blown even more!
Data from method1 is in memory
Usage: 5.322555392
Data from method1 and method 2 are in memory
Usage: 5.461917696
Compared the two lists
Usage: 5.461917696
Data from method2 is in memory
Usage: 2.747633664
Nothing is in memory
Usage: 0.004395008000000006

故事的寓意... python的内存似乎有点像Monty Python的Camelot ...这真是个愚蠢的地方.

moral of the story... memory for python appear to be a bit like Camelot for Monty Python... 'tis a very silly place.

推荐答案

我建议您退出并以直接满足您目标的方式进行处理:减少峰值内存使用量.没有数量的分析和以后再摆弄可以用注定的方法来克服;-)

I'm going to suggest that you back off and approach this instead in a way that directly addresses your goal: shrinking peak memory use to begin with. No amount of analysis & fiddling later can overcome using a doomed approach to begin with ;-)

具体来说,您第一步通过data=f.read()站错了脚.现在已经已经,您的程序可能无法扩展到完全适合RAM的数据文件,而且还有足够的空间来运行(运行OS和Python等).

Concretely, you got off on a wrong foot at the first step, via data=f.read(). Now it's already the case that your program can't possibly scale beyond a data file that fits entirely in RAM with room to spare (to run the OS and Python and ...) too.

您是否真的需要 all 一次将所有数据存储在RAM中?没有足够的细节来说明以后的步骤,但显然不是一开始,因为您立即想扔掉75%的阅读行.

Do you actually need all the data to be in RAM at one time? There are too few details to tell about later steps, but obviously not at the start, since you immediately want to throw away 75% of the lines you read.

因此,请逐步执行以下操作:

So start off by doing that incrementally instead:

def getlines4(f):
    for i, line in enumerate(f):
        if i % 4 == 1:
            yield line

即使您只做那么多事情,也可以直接跳至第3步的结果,从而节省了大量的RAM峰值使用量:

Even if you do nothing other than just that much, you can skip directly to the result of step 3, saving an enormous amount of peak RAM use:

with open(file, 'r') as f:
    data = list(getlines4(f))

现在峰值RAM需求与您关心的唯一行中的字节数成正比,而不与文件字节周期的总数成正比.

Now peak RAM need is proportional to the number of bytes in the only lines you care about, instead of to the total number of file bytes period.

要继续取得进步,而不是一次庞大地实现data中的所有感兴趣的行,而是将行(或行的大块)也逐步地输入到工作进程中.我没有足够的细节来建议具体的代码,但请牢记目标,您会明白:您只需要足够的RAM来保持向工作进程逐步馈送线路,并且要节省很多工作进程的结果,您需要将其保留在RAM中. 可能,无论输入文件的大小如何,峰值内存使用都不需要多于微小".

To continue making progress, instead of materializing all the lines of interest in data in one giant gulp, feed the lines (or chunks of lines) incrementally to your worker processes too. There wasn't enough detail for me to suggest concrete code for that, but keep the goal in mind and you'll figure it out: you only need enough RAM to keep incrementally feeding lines to worker processes, and to save away however much of the worker processes' results you need to keep in RAM. It's possible that peak memory use doesn't need to more than "tiny", regardless of input file size.

处理内存管理细节比起采用内存友好的方法要困难得多. Python本身有几个内存管理子系统,关于每个子系统,可以说很多.反过来,他们依赖于平台C malloc/free设施,对此还有很多要学习的地方.而且,我们仍然处于与您的操作系统报告的内存使用情况"没有直接关系的水平.平台C库又依赖于特定于平台的OS内存管理原语,通常,只有OS内核内存专家才能真正理解这些原语.

Fighting memory management details instead is enormously harder than taking a memory-friendly approach to begin with. Python itself has several memory-management subsystems, and a great deal can be said about each of them. They in turn rely on the platform C malloc/free facilities, about which there's also a great deal to learn. And we're still not at a level that has anything directly to do with what your operating system reports for "memory use". The platform C libraries in turn rely on platform-specific OS memory managment primitives, which - typically - only OS kernel memory experts truly understand.

为什么操作系统会说我还在使用N GiB RAM?"的答案.可能依赖于那些层中任何一层的特定于应用程序的细节,甚至不幸的是,它们之间或多或少的偶然交互.最好不要安排先问这样的问题.

The answer to "why does the OS say I'm still using N GiB of RAM?" can rely on application-specific details in any one of those layers, or even on unfortunate more-or-less accidental interactions among them. Far better to arrange not to need to ask such questions to begin with.

您提供了一些可运行的代码非常棒,但没有那么大的优点,因为没有其他人拥有您的数据,您只能运行它;-)诸如有多少行?和线长的分布是什么?"可能很关键,但我们无法猜测.

It's great that you gave some runnable code, but not so great that nobody but you can run it since nobody else has your data ;-) Things like "how many lines are there?" and "what's the distribution of line lengths?" can be critical, but we have no way to guess.

正如我之前指出的,特定于应用程序的细节对于超越现代内存管理器通常是必需的.它们很复杂,并且在 all 级别上的行为都可能是微妙的.

As I noted before, application-specific details are often necessary to out-think modern memory managers. They're complex, and behavior at all the levels can be subtle.

Python的主要对象分配器("obmalloc")向平台C malloc请求"arenas",其块为2 ** 18字节.只要那是您的应用程序正在使用的Python内存系统(无法猜测,因为我们没有您的数据就可以使用),256 KiB是内存中最小的最小粒度要求或返回到C级.反过来,C级别通常具有自己的大块化"策略,这些策略在C实施中会有所不同.

Python's primary object allocator ("obmalloc") requests "arenas" from the platform C malloc, chunks of 2**18 bytes. So long as that's the Python memory system your application is using (which can't be guessed at because we don't have your data to work with), 256 KiB is the smallest granularity at which memory is requested from, or returned to, the C level. The C level in turn typically has "chunk things up" strategies of its own, which vary across C implementations.

Python竞技场依次划分为4个KiB池",每个池都可以动态地适应为每个池划分成固定大小的较小块(8字节块,16字节块,24字节块, ...,每个池8 * i字节的块).

A Python arena is in turn carved into 4 KiB "pools", each of which dynamically adapts to be carved into smaller chunks of a fixed size per pool (8-byte chunks, 16-bytes chunks, 24-byte chunks, ..., 8*i-byte chunks per pool).

只要将竞技场中的单个字节用于实时数据,就必须保留整个竞技场.如果这意味着其他262,143个竞技场字节未使用,那就很不幸.如您的输出所示,所有内存最终都将返回,那么您为什么真正关心呢?我理解这是一个抽象有趣的难题,但是您将花费大量精力来理解CPython obmalloc.c中的代码,而不会解决它.作为一个开始.任何摘要"都将忽略实际上对某些应用程序的微观行为重要的细节.

So long as a single byte in an arena is being used for live data, the entire arena must be retained. If that means the other 262,143 arena bytes sit unused, tough luck. As your output shows, all the memory is returned in the end, so why do you really care? I understand it's an abstractly interesting puzzle, but you're not going to solve it short of making major efforts to understand the code in CPython's obmalloc.c. For a start. Any "summary" would leave out a detail that's actually important to some application's microscopic behavior.

可行:您的字符串足够短,可以从CPython的obmalloc获得所有字符串对象标头和内容(实际的字符串数据)的空间.他们将被散布在多个领域.竞技场可能看起来像这样,其中"H"表示从中分配字符串对象标头的池,而"D"表示从中分配字符串数据空间的池:

Plausible: your strings are short enough that space for all the string object headers and contents (the actual string data) are obtained from CPython's obmalloc. They're going to be splattered all over multiple arenas. An arena might look like this, where "H" represents pools from which string object headers are allocated, and "D" pools from which space for string data is allocated:

HHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDD...

在您的method1中,它们倾向于交替使用那样",因为创建单个字符串对象需要分别为字符串对象标头和字符串对象数据分配空间.当继续扔掉创建的字符串的3/4分时,该空间的大约3/4分变成可重用到Python .但是不能将一个字节返回给系统C,因为整个舞台上都散布了 still 实时数据,其中包含未丢弃的字符串对象的四分之一(此处的-"表示可用空间)以便重复使用):

In your method1 they'll tend to alternate "like that" because creating a single string object requires allocating space separately for the string object header and the string object data. When you go on to throw out 3/4ths of the strings you created, more-or-less 3/4ths of that space becomes reusable to Python. But not one byte can be returned to the system C because there's still live data sprayed all over the arena, containing the quarter of the string objects you didn't throw away (here "-" means space available for reuse):

HHDD------------HHDD------------HHDD------------HHDD----...

有太多可用空间,实际上,即使您不丢掉method2,剩下的method2也可能会从method1留下的--------孔中获取所需的所有内存. method1结果.

There's so much free space that, in fact, it's possible that the less wasteful method2 can get all the memory it needs from the -------- holes left over from method1 even when you don't throw away the method1 result.

为了使事情简单;-),我将注意到关于Python如何使用CPython的obmalloc的一些细节在不同的Python版本中也有所不同.通常,Python版本越新,首先尝试使用obmalloc而不是平台C malloc/free的尝试次数就越多(因为obmalloc通常更快).

Just to keep things simple ;-) , I'll note that some of those details about how CPython's obmalloc gets used vary across Python releases too. In general, the more recent the Python release, the more it tries to use obmalloc first instead of the platform C malloc/free (because obmalloc is generally faster).

但是,即使您直接使用平台C malloc/free,您仍然可以看到同样的情况.内核内存系统调用通常比仅在用户空间中运行代码要昂贵,因此平台C malloc/free例程通常具有自己的策略,向内核请求的内存要比单个请求所需的内存多得多,并将其分解为我们自己的小块".

But even if you use the platform C malloc/free directly, you can still see the same kinds of things happening. Kernel memory system calls are typically more expensive than running code purely in user space, so platform C malloc/free routines typically have their own strategies for "ask the kernel for much more memory than we need for a single request, and carve it up into smaller pieces ourself".

需要注意的事情:Python的obmalloc和platorm C malloc/free实现都没有移动自己的实时数据.两者都将内存地址返回给客户端,而这些地址不能更改.在这两种情况下,孔"都是生活中不可避免的事实.

Something to note: neither Python's obmalloc nor platorm C malloc/free implementations ever move live data on their own. Both return memory addresses to clients, and those cannot change. "Holes" are an inescapable fact of life under both.

这篇关于了解python中的内存使用情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆