我如何找到生产系统中Python进程中正在使用内存的内容? [英] How do I find what is using memory in a Python process in a production system?
问题描述
我的生产系统偶尔会出现内存泄漏,而我在开发环境中无法复制该泄漏.我已经使用 Python内存分析器(特别是Heapy)在开发环境中取得了一些成功,但是它无法帮助我解决无法复制的问题,并且我不愿意使用Heapy对我们的生产系统进行检测,因为它需要花费一些时间来完成它的工作,并且其线程化的远程接口在我们的服务器中无法正常工作.>
我想我想要的是一种转储生产Python进程(或至少gc.get_objects)快照,然后离线分析它以查看它在哪里使用内存的方法. 如何获取以下内容的核心转储这样的python进程?一旦有了一个python进程,我该如何做一些有用的事情?
根据最近的经验,我将进一步介绍Brett的回答. Dozer软件包是维护良好,尽管取得了一些进步,例如在Python 3.4中将tracemalloc
添加到stdlib中,但它的gc.get_objects
计数表却是我解决内存泄漏的首选工具.下面,我使用在撰写本文时尚未发布的dozer > 0.7
(好吧,因为我最近在此进行了一些修复).
示例
让我们看看一个不重要的内存泄漏.我将在此处使用 Celery 4.4并最终发现导致泄漏的功能(并且因为它是一个错误/功能性的事物,可以称为单纯的错误配置,由无知引起).所以有一个Python 3.6 venv ,其中我pip install celery < 4.5
.并具有以下模块.
demo.py
import time
import celery
redis_dsn = 'redis://localhost'
app = celery.Celery('demo', broker=redis_dsn, backend=redis_dsn)
@app.task
def subtask():
pass
@app.task
def task():
for i in range(10_000):
subtask.delay()
time.sleep(0.01)
if __name__ == '__main__':
task.delay().get()
基本上是一个计划一堆子任务的任务.怎么可能出问题了?
我将使用 procpath
分析Celery节点的内存消耗. pip install procpath[jsonpath]
.我有4个终端:
-
python -m procpath record -d celery.sqlite -i1 "$..children[?('celery' in @.cmdline)]"
记录Celery节点的进程树统计信息 -
docker run --rm -it -p 6379:6379 redis
运行Redis,它将用作Celery经纪人和结果后端 -
celery -A demo worker --concurrency 2
使用2个工作线程运行节点 -
python demo.py
最终运行示例
(4)将在2分钟内完成.
然后,我使用 Falcon SQL Client 来可视化procpath
具有记录器的内容.我使用以下查询:
SELECT datetime(ts, 'unixepoch', 'localtime') ts, stat_pid, stat_rss / 256.0 rss
FROM record
然后在Falcon中,用X=ts
,Y=rss
创建折线图跟踪,并添加拆分变换By=stat_pid
.结果图表为:
与内存泄漏斗争的任何人都可能非常熟悉这种形状.
发现泄漏的物体
现在是时候开始dozer
了.我将展示无工具的情况(如果可以的话,您可以用类似的方式来检测代码).要将Dozer服务器注入目标进程,我将使用 Pyrasite .有两件事要知道:
- 要运行它,必须配置 ptrace 如经典ptrace权限":
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
,这可能会带来安全风险 - 目标Python进程崩溃的可能性不为零
请注意,我:
-
pip install https://github.com/mgedmin/dozer/archive/3ca74bd8.zip
(我上面提到的就是0.8) -
pip install pillow
(dozer
用于制图) -
pip install pyrasite
之后,我可以在目标进程中获取Python shell:
pyrasite-shell 26572
并注入以下内容,它们将使用stdlib的wsgiref
服务器运行Dozer的WSGI应用程序.
import threading
import wsgiref.simple_server
import dozer
def run_dozer():
app = dozer.Dozer(app=None, path='/')
with wsgiref.simple_server.make_server('', 8000, app) as httpd:
print('Serving Dozer on port 8000...')
httpd.serve_forever()
threading.Thread(target=run_dozer, daemon=True).start()
在浏览器中打开http://localhost:8000
时,应该会看到类似以下内容的内容:
之后,我再次从(4)运行python demo.py
并等待其完成.然后在推土机中,将"Floor"设置为5000,这就是我看到的内容:
与Celery相关的两种类型随着子任务的调度而增长:
-
celery.result.AsyncResult
-
vine.promises.promise
weakref.WeakMethod
具有相同的形状和数字,并且必须由同一件事引起.
发现根本原因
在这一点上,从泄漏的类型和趋势来看,您的情况可能已经很清楚了.如果不是,则推土机每种类型都有"TRACE"链接,该链接允许跟踪(例如,查看对象的属性)所选对象的引荐来源网址(gc.get_referrers
)和引用对象(gc.get_referents
),并继续遍历图形的过程.
但是一张图片说出一千个单词,对不对?因此,我将展示如何使用 objgraph
呈现所选对象的依赖关系图.
-
pip install objgraph
-
apt-get install graphviz
然后:
- 我再次从(4)运行
python demo.py
- 在推土机中,我设置了
floor=0
,filter=AsyncResult
- 然后单击"TRACE",它应该产生
然后在Pyrasite Shell中运行:
objgraph.show_backrefs([objgraph.at(140254427663376)], filename='backref.png')
PNG文件应包含:
基本上,有一些Context
对象包含名为_children
的list
,而该对象又包含许多celery.result.AsyncResult
实例,这些实例会泄漏.这是我在推土机中更改Filter=celery.*context
的结果:
因此罪魁祸首是celery.app.task.Context
.搜索该类型肯定会导致您进入 Celery任务页面.在其中快速搜索孩子",其内容如下:
trail = True
如果启用,则请求将跟踪由该任务启动的子任务,并且此信息将与结果(
result.children
)一起发送.
通过设置trail=False
来禁用跟踪,例如:
@app.task(trail=False)
def task():
for i in range(10_000):
subtask.delay()
time.sleep(0.01)
然后从(3)重新启动Celery节点,从(4)重新启动python demo.py
,显示此内存消耗.
问题解决了!
My production system occasionally exhibits a memory leak I have not been able to reproduce in a development environment. I've used a Python memory profiler (specifically, Heapy) with some success in the development environment, but it can't help me with things I can't reproduce, and I'm reluctant to instrument our production system with Heapy because it takes a while to do its thing and its threaded remote interface does not work well in our server.
What I think I want is a way to dump a snapshot of the production Python process (or at least gc.get_objects), and then analyze it offline to see where it is using memory. How do I get a core dump of a python process like this? Once I have one, how do I do something useful with it?
I will expand on Brett's answer from my recent experience. Dozer package is well maintained, and despite advancements, like addition of tracemalloc
to stdlib in Python 3.4, its gc.get_objects
counting chart is my go-to tool to tackle memory leaks. Below I use dozer > 0.7
which has not been released at the time of writing (well, because I contributed a couple of fixes there recently).
Example
Let's look at a non-trivial memory leak. I'll use Celery 4.4 here and will eventually uncover a feature which causes the leak (and because it's a bug/feature kind of thing, it can be called mere misconfiguration, cause by ignorance). So there's a Python 3.6 venv where I pip install celery < 4.5
. And have the following module.
demo.py
import time
import celery
redis_dsn = 'redis://localhost'
app = celery.Celery('demo', broker=redis_dsn, backend=redis_dsn)
@app.task
def subtask():
pass
@app.task
def task():
for i in range(10_000):
subtask.delay()
time.sleep(0.01)
if __name__ == '__main__':
task.delay().get()
Basically a task which schedules a bunch of subtasks. What can go wrong?
I'll use procpath
to analyse Celery node memory consumption. pip install procpath[jsonpath]
. I have 4 terminals:
python -m procpath record -d celery.sqlite -i1 "$..children[?('celery' in @.cmdline)]"
to record the Celery node's process tree statsdocker run --rm -it -p 6379:6379 redis
to run Redis which will serve as Celery broker and result backendcelery -A demo worker --concurrency 2
to run the node with 2 workerspython demo.py
to finally run the example
(4) will finish under 2 minutes.
Then I use Falcon SQL Client to visualise what procpath
has recorder. I use this query:
SELECT datetime(ts, 'unixepoch', 'localtime') ts, stat_pid, stat_rss / 256.0 rss
FROM record
And in Falcon I create a line chart trace with X=ts
, Y=rss
, and add split transform By=stat_pid
. The result chart is:
This shape is likely pretty familiar to anyone who fought with memory leaks.
Finding leaking objects
Now it's time for dozer
. I'll show non-instrumented case (and you can instrument your code in similar way if you can). To inject Dozer server into target process I'll use Pyrasite. There are two things to know about it:
- To run it, ptrace has to be configured as "classic ptrace permissions":
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
, which is may be a security risk - There are non-zero chances that your target Python process will crash
With that caveat I:
pip install https://github.com/mgedmin/dozer/archive/3ca74bd8.zip
(that's to-be 0.8 I mentioned above)pip install pillow
(whichdozer
uses for charting)pip install pyrasite
After that I can get Python shell in the target process:
pyrasite-shell 26572
And inject the following, which will run Dozer's WSGI application using stdlib's wsgiref
's server.
import threading
import wsgiref.simple_server
import dozer
def run_dozer():
app = dozer.Dozer(app=None, path='/')
with wsgiref.simple_server.make_server('', 8000, app) as httpd:
print('Serving Dozer on port 8000...')
httpd.serve_forever()
threading.Thread(target=run_dozer, daemon=True).start()
Opening http://localhost:8000
in a browser there should see something like:
After that I run python demo.py
from (4) again and wait for it to finish. Then in Dozer I set "Floor" to 5000, and here's what I see:
Two types related to Celery grow as the subtask are scheduled:
celery.result.AsyncResult
vine.promises.promise
weakref.WeakMethod
has the same shape and numbers and must be caused by the same thing.
Finding root cause
At this point from the leaking types and the trends it may be already clear what's going on in your case. If it's not, Dozer has "TRACE" link per type, which allows tracing (e.g. seeing object's attributes) chosen object's referrers (gc.get_referrers
) and referents (gc.get_referents
), and continue the process again traversing the graph.
But a picture says a thousand words, right? So I'll show how to use objgraph
to render chosen object's dependency graph.
pip install objgraph
apt-get install graphviz
Then:
- I run
python demo.py
from (4) again - in Dozer I set
floor=0
,filter=AsyncResult
- and click "TRACE" which should yield
Then in Pyrasite shell run:
objgraph.show_backrefs([objgraph.at(140254427663376)], filename='backref.png')
The PNG file should contain:
Basically there's some Context
object containing a list
called _children
that in turn is containing many instances of celery.result.AsyncResult
, which leak. Changing Filter=celery.*context
in Dozer here's what I see:
So the culprit is celery.app.task.Context
. Searching that type would certainly lead you to Celery task page. Quickly searching for "children" there, here's what it says:
trail = True
If enabled the request will keep track of subtasks started by this task, and this information will be sent with the result (
result.children
).
Disabling the trail by setting trail=False
like:
@app.task(trail=False)
def task():
for i in range(10_000):
subtask.delay()
time.sleep(0.01)
Then restarting the Celery node from (3) and python demo.py
from (4) yet again, shows this memory consumption.
Problem solved!
这篇关于我如何找到生产系统中Python进程中正在使用内存的内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!