强制 Python 释放对象以释放内存 [英] Force Python to release objects to free up memory

查看:158
本文介绍了强制 Python 释放对象以释放内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行以下代码:

from myUtilities import myObject
for year in range(2006,2015):
    front = 'D:\\newFilings\\'
    back = '\\*\\dirTYPE\\*.sgml'
    path = front + str(year) + back
    sgmlFilings = glob.glob(path)
    for each in sgmlFilings:
        header = myObject(each)
        try:
            tagged = header.process_tagged('G:')
        except Exception as e:
            outref = open('D:\\ProblemFiles.txt','a')
            outref.write(each '\n')
            outref.close()
            print each

如果我从重启开始,python 的内存分配/消耗相当小.随着时间的推移,虽然它显着增加,但最终在大约一天后,我的可用内存很少(安装了 24GB [294 mb 空闲 23960 缓存])并且 Python 在 Windows 任务管理器列表中声明的内存为 3GB.我正在观察针对文件集合运行代码所需的三天时间的增长.

If I start from a reboot the memory allocation/consumption by python is fairly small. Over time though it increases significantly and ultimately after about a day I have very little free memory (24GB installed [294 mb free 23960 cached]) and the memory claimed by Python in the Windows Task Manager list is 3GB. I am watching this increase over the three days it takes to run the code against the file collection.

我的印象是,因为我正在做所有事情

I was under the impression that since I am doing everything with

tagged = header.process_tagged('G:')

与每个循环相关的内存将被释放和垃圾收集.

that the memory associated with each loop would be freed and garbage collected.

有什么办法可以强制释放这个内存.虽然我还没有运行统计数据,但我可以通过观察磁盘上的活动来判断该过程随着时间的推移而变慢(并且内存〜块〜变大)

Is there something I can do to force the release of this memory. While I have not run statistics yet I can tell by watching activity on the disk that the process slows down as time (and the memory ~lump~ gets bigger) progresses

编辑

我查看了下面引用的问题,我认为这些问题与我在另一个问题中所理解的问题不同,即它们持有对象(三角形列表)并且需要整个列表进行计算.在每个循环中,我都在读取一个文件,对文件进行一些处理,然后将其写回磁盘.然后我正在阅读下一个文件...

I looked at the question referenced below and I do not think these are the same as the issue as I understand in the other question is that they are holding onto the objects (list of triangles) and need the entire list for computation. With each loop I am reading a file, performing some processing of the file and then writing it back out to disk. And then I am reading the next file . . .

关于可能的内存泄漏,我在 myObject 中使用了 LXML

With regards to possible memory leaks I am using LXML in myObject

注意,自从这个问题的第一次迭代以来,我添加了来自 MyUtilities import myObject 的行.MyUtilities 包含执行所有操作的代码

Note, I added the line from MyUtilities import myObject since the first iteration of this question. MyUtilities holds the code that does everything

关于为 myUtilities 发布我的代码 - 摆脱了基本问题 - 在每次迭代标记后,我完成了标题和标记,并将结果写入另一个驱动器,实际上是一个新格式化的驱动器.

Regarding posting my code for myUtilities - that gets away from the basic question - I am done with header and tagged after each iteration tagged does stuff and writes the results to another drive, as a matter of fact a newly formatted drive.

我考虑过使用多处理,但我没有考虑使用多处理,因为我有一个模糊的想法,因为这是 I/O 密集型,我会争夺驱动器磁头 - 也许这是错误的,但因为每次迭代都需要我编写几百 MB 的文件,我认为我会争夺写入甚至读取时间.

I looked into using multiprocessing but I didn't because of a vague idea that since this is so I/O intensive that I would be competing for the drive heads - maybe that is wrong but since each iteration requires that I write a couple of hundred MB files I would think I would be competing for write and or even read time.

更新 - 所以我在 myObjectclass 中有一个案例,其中一个文件被打开

UPDATE - so I had one case in the myObjectclass where a file was opened with

myString = open(somefile).read()

我把它改成

with open(somefile,'r') as fHandle:

`    myString = fHandle.read()`

(抱歉格式化 - 仍在挣扎)

(sorry for the formatting - still struggling)

然而,这并没有明显的影响,当我开始一个新周期时,我有 4000 mb 的缓存内存,22 分钟并处理 27K 文件后,我有大约 26000 mb 的缓存内存.

However, this had no apparent affect, When I started a new cycle I had 4000 mb of Cached memory, after 22 minutes and processing of 27K files I had roughly 26000 mb of Cached memory.

我感谢下面的所有答案和评论,并且一整天都在阅读和测试各种内容.我会更新这个,因为我认为这项任务需要一个星期,现在看起来可能需要一个多月.

I appreciate all of the answers and comments below and have been reading up and testing various things all day. I will update this as I thought this task would take a week and now it looks like it might take over a month.

我不断收到有关代码其余部分的问题.然而,它有超过 800 行,对我来说,这有点偏离中心问题

I keep getting questions about the rest of the code. However, it is over 800 lines and to me that sort of gets away from the central question

于是创建了一个 myObject 的实例然后我们将包含在 myObject 中的方法应用到 header

So an instance of myObject is created Then we apply methods contained in myObject to header

这基本上是文件转换.读入文件,制作文件部分的副本并写入磁盘.

This is basically file transformation. a file is read in, and copies of parts of the file are made and written to disk.

对我来说中心问题是header标记 显然有一些持久性.如何在开始下一个循环之前处理与标题或标记相关的所有内容.

The central question to me is that there is obviously some persistence with either header or tagged. How can I dispose of everything related to header or tagged before I start the next cycle.

过去 14 个小时左右,我一直在运行代码.当它完成第一个周期时,处理 27K 个文件需要大约 22 分钟,现在处理大约相同数量的文件需要一个半小时.

I have been running the code now for the last 14 hours or so. When it went through the first cycle it took about 22 minutes to process 27K files, now it is taking an hour and a half to handle approximately the same number.

仅仅运行 gc.collect 是行不通的.我停止了程序并在解释器中进行了尝试,但我发现内存统计数据没有任何变化.

Just running gc.collect does not work. I stopped the program and tried that in the interpreter and I saw no movement in memory statistics.

从下面阅读内存分配器描述后编辑我认为缓存中绑定的数量不是问题 - 它是正在运行的python进程绑定的数量.所以新的测试是从命令行运行代码.我会继续观看和监控,一旦看到会发生什么,我会发布更多.

EDIT after reading the memoryallocator description from below I am thinking that the amount tied up in the cache is not the problem - it is the amount tied up by the running python process. So new test is running the code from the command line. I am continuing to watch and monitor and will post more once I see what happens.

仍在挣扎,但已将代码设置为从带有 sgmlFilings 循环数据的 bat 文件运行(见上文)批处理文件如下所示

still struggling but have set up the code to run from a bat file with data from one loop of sgmlFilings (see above) the batch file looks like this

python batch.py
python batch.py
 .
 .
 .

batch.py​​ 首先读取一个包含要 glob 的目录列表的队列文件,它从列表中取出第一个,更新列表并保存它,然后运行 ​​headertagged 进程.笨拙但由于 python.exe 在每次迭代后关闭,python 永远不会累积内存,因此该进程以一致的速度运行.

The batch.py starts by reading a queue file that has a list of directories to glob, it takes the first one off the list, updates the list and saves it and then it runs the header and tagged processes. Clumsy but since the python.exe is closed after each iteration python never accumulates memory and so the process is running at a consistent speed.

推荐答案

原因是 CPython 的内存管理.Python 管理内存的方式使长时间运行的程序变得困难.当您使用 del 语句显式释放对象时,CPython 必然不会将分配的内存返回给操作系统.它保留内存以备将来进一步使用.解决此问题的一种方法是使用 Multiprocessing 模块并在完成工作后终止进程并创建另一个进程.通过这种方式,您可以强制释放内存,操作系统必须释放该子进程使用的内存.我有完全相同的问题.随着时间的推移,内存使用量过度增加,以至于系统变得不稳定和无响应.我对信号和 psutil 使用了不同的技术来解决它.例如,当您有一个循环并且需要在堆栈上分配和取消分配数据时,通常会出现此问题.

The reason is CPython's memory management. The way Python manages memory make it hard for long running programs. When you explicitly free an object with del statement, CPython necessarily does not return allocated memory to the OS. It keeps the memory for further use in future. One way to work this problem around is to use Multiprocessing module and kill the process after you are done with the job and create another one. This way you free the memory by force and OS must free up the memory used by that child process. I have had exact same problem. Memory usage excessively increased over time to the point which system became unstable and unresponsive. I used a different technique with signals and psutil to work it around. This problem occurs commonly when you have a loop and need to allocate and deallocate data on a stack, for example.

您可以在此处阅读有关 Python 内存分配器的更多信息:http://www.evanjones.ca/memoryallocator/

You can read more about Python memory allocator here : http://www.evanjones.ca/memoryallocator/

这个工具也非常有助于分析内存使用情况:https://pypi.python.org/pypi/memory_profiler

This tool is also very helpful to profile memory usage : https://pypi.python.org/pypi/memory_profiler

还有一件事,将 slots 添加到 myObject,看来您的对象内部有固定的插槽,这也有助于减少 ram 的使用.没有指定 slots 的对象会分配更多的 ram 来处理您稍后可能添加到它们的动态属性:http://tech.oyster.com/save-ram-with-python-slots/

One more thing, add slots to myObject, it seems you have fixed slots inside your object, this also helps with reducing ram usage. Objects without slots specified allocates more ram to take care of dynamic attributes you may add to them later : http://tech.oyster.com/save-ram-with-python-slots/

这篇关于强制 Python 释放对象以释放内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆