为什么我的Python应用程序因“系统"/内核CPU时间而停滞不前 [英] Why is my Python app stalled with 'system' / kernel CPU time

查看:64
本文介绍了为什么我的Python应用程序因“系统"/内核CPU时间而停滞不前的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我不确定是否应该将其发布为Ubuntu问题或在此处发布.但是我想这更多是Python问题,而不是操作系统问题.

First off I wasn't sure if I should post this as a Ubuntu question or here. But I'm guessing it's more of an Python question than a OS one.

我的Python应用程序在64核AMD服务器上的Ubuntu上运行.它通过 ctypes 调用一个.so来通过网络从5个GigE摄像机中提取图像,然后对其进行处理.我在我的应用程序中看到频繁的暂停,导致来自相机的帧被外部相机库丢弃.

My Python application is running on top of Ubuntu on a 64 core AMD server. It pulls images from 5 GigE cameras over the network by calling out to a .so through ctypes and then processes them. I am seeing frequent pauses in my application causing frames from the cameras to be dropped by the external camera library.

为调试此问题,我使用了流行的 psutil Python程序包,通过该程序包,我每0.2秒在单独的线程中注销一次CPU统计信息.我在那个线程中睡眠了0.2秒,当睡眠时间长得多时,我还看到了相机框架掉落了.我看到暂停时间长达17秒!我的大部分处理工作都是在OpenCV或Numpy中进行(两者都发布了GIL),或者在应用程序的一部分中是 multiprocessing.Pool 有59个进程(可以绕过Python GIL).

To debug this I've used the popular psutil Python package with which I log out CPU stats every 0.2 seconds in a separate thread. I sleep for 0.2 seconds in that thread and when that sleep takes substantially longer I also see camera frames being dropped. I have seen pauses up to 17 seconds long! Most of my processing is either in OpenCV or Numpy (both of which release the GIL) or in one part of the app a multiprocessing.Pool with 59 processes (this it to get around the Python GIL).

当暂停发生时,我的调试日志记录显示我的许多进程线程上的系统"(即内核)CPU时间非常长.

My debug logging shows very high 'system' (i.e. kernel) CPU time on many of my process' threads when the pauses happen.

例如.我看到如下所示的CPU时间(通常每0.2秒一次),然后突然大跳(进程"数字表示CPU利用率,即1个完全使用的CPU为1,Linux top 显示123%为1.2):

For example. I see CPU times as follows (usually every 0.2 seconds) and then suddenly a big jump ('Process' numbers are in CPU utilization, i.e. 1 CPU fully used would be 1, Linux top showing 123% would be 1.2):

Process user | Process system | OS system % | OS idle %
19.9         | 10.5           | 6           | 74 
5.6          | 2.3            | 4           | 87
6.8          | 1.7            | 11          | 75
4.6          | 5.5            | 43          | 52
0.5          | 26.4           | 4           | 90

我不知道为什么在匹配高进程系统使用率之前将高OS系统使用率报告为一行.由于64核中的26.4 = 41%,因此两者相匹配.那时我的应用程序经历了大约 3.5 秒的暂停(由我的CPU信息日志记录线程使用OpenCV的 cv2.getTickCount()以及Python日志记录输出中的时间戳跳转确定)导致多个相机框架掉落.

I don't know why the high OS system usage is reported one line before matching high process system usage. The two match up since 26.4 of 64 cores = 41%. At that point my application experienced an approximately 3.5 second pause (as determined by my CPU info logging thread using OpenCV's cv2.getTickCount() and also the jump in time stamps in the Python logging output) causing multiple camera frames to be dropped.

发生这种情况时,我还记录了进程中每个线程的CPU信息.对于上面的示例,有25个线程在系统" CPU利用率为0.9的情况下运行,而在0.6处的线程利用率更高,与上面26.4的过程的总和相匹配.那时大约有183个线程正在运行.

When this happens I have also logged the CPU info for each thread of my process. For the example above 25 threads were running at a 'system' CPU utilization of 0.9 and a few more at 0.6, which matches the total for the process of 26.4 above. At that point there were about 183 threads running.

这种暂停似乎在使用多处理池后很近发生(它用于短时间突发),但绝不会在每次使用池时发生.此外,如果我将需要在池外进行的处理量减半,则不会发生摄像机跳过的情况.

This pause usually seems to happen close after the multiprocessing pool is used (it's used for short bursts) but by no means happens every time the pool is used. Also, if I halve the amount of processing that needs to happen outside the pool then no camera skipping happens.

问题:如何确定OS系统"/内核时间为何突然消失?为什么在Python应用程序中会发生这种情况?

Question: how can I determine why OS 'system' / kernel time suddenly goes through the roof? Why would that happen in a Python app?

更重要的是:关于为什么发生这种情况以及如何避免的任何想法?

And more importantly: any ideas why this is happening and how to avoid it?

注意:

  • 此文件从 upstart
  • 作为root用户运行(不幸的是,它必须用于相机库)
  • 关闭相机后,应用会重新启动(在upstart中使用 respawn ),并且每天都会发生多次,所以这不是由于长时间运行而引起的,我还看到这种情况很快就会发生这个过程开始了
  • 一遍又一遍地运行相同的代码,这不是由于我的代码运行了不同的分支
  • 当前的 nice 为-2,我尝试了删除 nice 且没有影响
  • Ubuntu 12.04.5 LTS
  • Python 2.7
  • 机器有128GB的内存,我无法使用
  • This runs as root (it has to for the camera library unfortunately) from upstart
  • When the cameras are turned off the app restarts (using respawn in upstart) and this happens multiple times a day so it's not due to being long running, I have also seen this happen very soon after the process starts
  • It is the same code being run over and over, it's not due to running a different branch of my code
  • Currently has a nice of -2, I have tried removing the nice with no affect
  • Ubuntu 12.04.5 LTS
  • Python 2.7
  • Machine has 128GB of memory which I am no where near using

推荐答案

确定.我有我自己的问题的答案.是的,我花了3个月多的时间才能达到目标.

OK. I have the answer to my own question. Yes, it's taken me over 3 months to get this far.

在Python中似乎是GIL崩溃,这是导致大量系统" CPU峰值和相关暂停的原因.这是很好地解释了搅动的来源.那次演讲也为我指明了正确的方向.

It appears to be GIL thrashing in Python that is the reason for the massive 'system' CPU spikes and associated pauses. Here is a good explanation of where the thrashing comes from. That presentation also pointed me in the right direction.

Python 3.2 引入了新的GIL实现,以避免这种情况ing不休.结果可以通过一个简单的线程示例显示(取自上面的演示):

Python 3.2 introduced a new GIL implementation to avoid this thrashing. The result can be shown with a simple threaded example (taken from the presentation above):

from threading import Thread
import psutil

def countdown():
    n = 100000000
    while n > 0:
        n -= 1

t1 = Thread(target=countdown)
t2 = Thread(target=countdown)
t1.start(); t2.start()
t1.join(); t2.join()

print(psutil.Process().cpu_times())    

在带有Python 2.7.9的Macbook Pro上,它使用14.7s的用户" CPU和13.2s的系统" CPU.

On my Macbook Pro with Python 2.7.9 this uses 14.7s of 'user' CPU and 13.2s of 'system' CPU.

Python 3.4使用15.0s的用户"(略多),但仅使用0.2s的系统".

Python 3.4 uses 15.0s of 'user' (slightly more) but only 0.2s of 'system'.

因此,GIL仍然存在,它的运行速度仅与代码为单线程时一样快,但是它避免了Python 2的所有GIL争用(表现为内核(系统")CPU时间).我相信,这种争论是导致原始问题出现问题的原因.

So, the GIL is still in place, it still only runs as fast as when the code is single threaded, but it avoids all the GIL contention of Python 2 that manifests as kernel ('system') CPU time. This contention, I believe, is what was causing the issues of the original question.

发现CPU问题的另一个原因是OpenCV/TBB.完全记录在此 SO问题中.

An additional cause to the CPU problem was found to be with OpenCV/TBB. Fully documented in this SO question.

这篇关于为什么我的Python应用程序因“系统"/内核CPU时间而停滞不前的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆