我如何找到生产系统中Python进程中正在使用内存的内容? [英] How do I find what is using memory in a Python process in a production system?

查看：128 发布时间：2020/5/8 20:39:18 python memory-leaks coredump

本文介绍了我如何找到生产系统中Python进程中正在使用内存的内容?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的生产系统偶尔会出现内存泄漏，而我在开发环境中无法复制该泄漏.我已经使用 Python内存分析器(特别是Heapy)在开发环境中取得了一些成功，但是它无法帮助我解决无法复制的问题，并且我不愿意使用Heapy对我们的生产系统进行检测，因为它需要花费一些时间来完成它的工作，并且其线程化的远程接口在我们的服务器中无法正常工作.

我想我想要的是一种转储生产Python进程(或至少gc.get_objects)快照，然后离线分析它以查看它在哪里使用内存的方法. 如何获取以下内容的核心转储这样的python进程?一旦有了一个python进程，我该如何做一些有用的事情?

解决方案

根据最近的经验，我将进一步介绍Brett的回答. Dozer软件包是维护良好，尽管取得了一些进步，例如在Python 3.4中将tracemalloc添加到stdlib中，但它的gc.get_objects计数表却是我解决内存泄漏的首选工具.下面，我使用在撰写本文时尚未发布的dozer > 0.7(好吧，因为我最近在此进行了一些修复).

示例

让我们看看一个不重要的内存泄漏.我将在此处使用 Celery 4.4并最终发现导致泄漏的功能(并且因为它是一个错误/功能性的事物，可以称为单纯的错误配置，由无知引起).所以有一个Python 3.6 venv ，其中我pip install celery < 4.5.并具有以下模块.

demo.py

import time

import celery 


redis_dsn = 'redis://localhost'
app = celery.Celery('demo', broker=redis_dsn, backend=redis_dsn)

@app.task
def subtask():
    pass

@app.task
def task():
    for i in range(10_000):
        subtask.delay()
        time.sleep(0.01)


if __name__ == '__main__':
    task.delay().get()

基本上是一个计划一堆子任务的任务.怎么可能出问题了?

我将使用 procpath 分析Celery节点的内存消耗. pip install procpath[jsonpath].我有4个终端:

python -m procpath record -d celery.sqlite -i1 "$..children[?('celery' in @.cmdline)]"记录Celery节点的进程树统计信息
docker run --rm -it -p 6379:6379 redis运行Redis，它将用作Celery经纪人和结果后端
celery -A demo worker --concurrency 2使用2个工作线程运行节点
python demo.py最终运行示例

(4)将在2分钟内完成.

然后，我使用 Falcon SQL Client 来可视化procpath具有记录器的内容.我使用以下查询:

SELECT datetime(ts, 'unixepoch', 'localtime') ts, stat_pid, stat_rss / 256.0 rss
FROM record

然后在Falcon中，用X=ts，Y=rss创建折线图跟踪，并添加拆分变换By=stat_pid.结果图表为:

与内存泄漏斗争的任何人都可能非常熟悉这种形状.

发现泄漏的物体

现在是时候开始dozer了.我将展示无工具的情况(如果可以的话，您可以用类似的方式来检测代码).要将Dozer服务器注入目标进程，我将使用 Pyrasite .有两件事要知道:

要运行它，必须配置 ptrace 如经典ptrace权限":echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope，这可能会带来安全风险
目标Python进程崩溃的可能性不为零

请注意，我:

pip install https://github.com/mgedmin/dozer/archive/3ca74bd8.zip(我上面提到的就是0.8)
pip install pillow(dozer用于制图)
pip install pyrasite

之后，我可以在目标进程中获取Python shell:

pyrasite-shell 26572

并注入以下内容，它们将使用stdlib的wsgiref服务器运行Dozer的WSGI应用程序.

import threading
import wsgiref.simple_server

import dozer


def run_dozer():
    app = dozer.Dozer(app=None, path='/')
    with wsgiref.simple_server.make_server('', 8000, app) as httpd:
        print('Serving Dozer on port 8000...')
        httpd.serve_forever()

threading.Thread(target=run_dozer, daemon=True).start()

在浏览器中打开http://localhost:8000时，应该会看到类似以下内容的内容:

之后，我再次从(4)运行python demo.py并等待其完成.然后在推土机中，将"Floor"设置为5000，这就是我看到的内容:

与Celery相关的两种类型随着子任务的调度而增长:

celery.result.AsyncResult
vine.promises.promise

weakref.WeakMethod具有相同的形状和数字，并且必须由同一件事引起.

发现根本原因

在这一点上，从泄漏的类型和趋势来看，您的情况可能已经很清楚了.如果不是，则推土机每种类型都有"TRACE"链接，该链接允许跟踪(例如，查看对象的属性)所选对象的引荐来源网址(gc.get_referrers)和引用对象(gc.get_referents)，并继续遍历图形的过程.

但是一张图片说出一千个单词，对不对?因此，我将展示如何使用 objgraph 呈现所选对象的依赖关系图.

pip install objgraph
apt-get install graphviz

然后:

我再次从(4)运行python demo.py
在推土机中，我设置了floor=0，filter=AsyncResult
然后单击"TRACE"，它应该产生

然后在Pyrasite Shell中运行:

objgraph.show_backrefs([objgraph.at(140254427663376)], filename='backref.png')

PNG文件应包含:

基本上，有一些Context对象包含名为_children的list，而该对象又包含许多celery.result.AsyncResult实例，这些实例会泄漏.这是我在推土机中更改Filter=celery.*context的结果:

因此罪魁祸首是celery.app.task.Context.搜索该类型肯定会导致您进入 Celery任务页面.在其中快速搜索孩子"，其内容如下:

trail = True

如果启用，则请求将跟踪由该任务启动的子任务，并且此信息将与结果(result.children)一起发送.

通过设置trail=False来禁用跟踪，例如:

@app.task(trail=False)
def task():
    for i in range(10_000):
        subtask.delay()
        time.sleep(0.01)

然后从(3)重新启动Celery节点，从(4)重新启动python demo.py，显示此内存消耗.

问题解决了！

My production system occasionally exhibits a memory leak I have not been able to reproduce in a development environment. I've used a Python memory profiler (specifically, Heapy) with some success in the development environment, but it can't help me with things I can't reproduce, and I'm reluctant to instrument our production system with Heapy because it takes a while to do its thing and its threaded remote interface does not work well in our server.

What I think I want is a way to dump a snapshot of the production Python process (or at least gc.get_objects), and then analyze it offline to see where it is using memory. How do I get a core dump of a python process like this? Once I have one, how do I do something useful with it?

解决方案

I will expand on Brett's answer from my recent experience. Dozer package is well maintained, and despite advancements, like addition of tracemalloc to stdlib in Python 3.4, its gc.get_objects counting chart is my go-to tool to tackle memory leaks. Below I use dozer > 0.7 which has not been released at the time of writing (well, because I contributed a couple of fixes there recently).

Example

Let's look at a non-trivial memory leak. I'll use Celery 4.4 here and will eventually uncover a feature which causes the leak (and because it's a bug/feature kind of thing, it can be called mere misconfiguration, cause by ignorance). So there's a Python 3.6 venv where I pip install celery < 4.5. And have the following module.

demo.py

import time

import celery 


redis_dsn = 'redis://localhost'
app = celery.Celery('demo', broker=redis_dsn, backend=redis_dsn)

@app.task
def subtask():
    pass

@app.task
def task():
    for i in range(10_000):
        subtask.delay()
        time.sleep(0.01)


if __name__ == '__main__':
    task.delay().get()

Basically a task which schedules a bunch of subtasks. What can go wrong?

I'll use procpath to analyse Celery node memory consumption. pip install procpath[jsonpath]. I have 4 terminals:

python -m procpath record -d celery.sqlite -i1 "$..children[?('celery' in @.cmdline)]" to record the Celery node's process tree stats
docker run --rm -it -p 6379:6379 redis to run Redis which will serve as Celery broker and result backend
celery -A demo worker --concurrency 2 to run the node with 2 workers
python demo.py to finally run the example

(4) will finish under 2 minutes.

Then I use Falcon SQL Client to visualise what procpath has recorder. I use this query:

SELECT datetime(ts, 'unixepoch', 'localtime') ts, stat_pid, stat_rss / 256.0 rss
FROM record

And in Falcon I create a line chart trace with X=ts, Y=rss, and add split transform By=stat_pid. The result chart is:

This shape is likely pretty familiar to anyone who fought with memory leaks.

Finding leaking objects

Now it's time for dozer. I'll show non-instrumented case (and you can instrument your code in similar way if you can). To inject Dozer server into target process I'll use Pyrasite. There are two things to know about it:

To run it, ptrace has to be configured as "classic ptrace permissions": echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope, which is may be a security risk
There are non-zero chances that your target Python process will crash

With that caveat I:

pip install https://github.com/mgedmin/dozer/archive/3ca74bd8.zip (that's to-be 0.8 I mentioned above)
pip install pillow (which dozer uses for charting)
pip install pyrasite

After that I can get Python shell in the target process:

pyrasite-shell 26572

And inject the following, which will run Dozer's WSGI application using stdlib's wsgiref's server.

import threading
import wsgiref.simple_server

import dozer


def run_dozer():
    app = dozer.Dozer(app=None, path='/')
    with wsgiref.simple_server.make_server('', 8000, app) as httpd:
        print('Serving Dozer on port 8000...')
        httpd.serve_forever()

threading.Thread(target=run_dozer, daemon=True).start()

Opening http://localhost:8000 in a browser there should see something like:

After that I run python demo.py from (4) again and wait for it to finish. Then in Dozer I set "Floor" to 5000, and here's what I see:

Two types related to Celery grow as the subtask are scheduled:

celery.result.AsyncResult
vine.promises.promise

weakref.WeakMethod has the same shape and numbers and must be caused by the same thing.

Finding root cause

At this point from the leaking types and the trends it may be already clear what's going on in your case. If it's not, Dozer has "TRACE" link per type, which allows tracing (e.g. seeing object's attributes) chosen object's referrers (gc.get_referrers) and referents (gc.get_referents), and continue the process again traversing the graph.

But a picture says a thousand words, right? So I'll show how to use objgraph to render chosen object's dependency graph.

pip install objgraph
apt-get install graphviz

Then:

I run python demo.py from (4) again
in Dozer I set floor=0, filter=AsyncResult
and click "TRACE" which should yield

Then in Pyrasite shell run:

objgraph.show_backrefs([objgraph.at(140254427663376)], filename='backref.png')

The PNG file should contain:

Basically there's some Context object containing a list called _children that in turn is containing many instances of celery.result.AsyncResult, which leak. Changing Filter=celery.*context in Dozer here's what I see:

So the culprit is celery.app.task.Context. Searching that type would certainly lead you to Celery task page. Quickly searching for "children" there, here's what it says:

trail = True

If enabled the request will keep track of subtasks started by this task, and this information will be sent with the result (result.children).

Disabling the trail by setting trail=False like:

@app.task(trail=False)
def task():
    for i in range(10_000):
        subtask.delay()
        time.sleep(0.01)

Then restarting the Celery node from (3) and python demo.py from (4) yet again, shows this memory consumption.

Problem solved!

这篇关于我如何找到生产系统中Python进程中正在使用内存的内容?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

我如何找到生产系统中Python进程中正在使用内存的内容? [英] How do I find what is using memory in a Python process in a production system?

问题描述

示例

发现泄漏的物体

发现根本原因

Example

Finding leaking objects

Finding root cause

相关文章

Python最新文章

热门教程

热门工具

登录关闭

我如何找到生产系统中Python进程中正在使用内存的内容? [英] How do I find what is using memory in a Python process in a production system?

问题描述

示例

发现泄漏的物体

发现根本原因

Example

Finding leaking objects

Finding root cause

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭