Python/OpenCV应用程序锁定问题 [英] Python / OpenCV application lockup issue

查看:114
本文介绍了Python/OpenCV应用程序锁定问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在64核Linux机器上运行的Python应用程序通常可以正常运行.然后经过一段随机的时间(通常为0.5到1.5天)后,我突然开始频繁出现超过10秒的暂停/锁定!在这些锁定期间,系统CPU时间(即内核中的时间)可能超过90%(是的:所有64个内核中的90%,而不仅仅是一个CPU).

My Python application running on a 64-core Linux box normally runs without a problem. Then after some random length of time (around 0.5 to 1.5 days usually) I suddenly start getting frequent pauses/lockups of over 10 seconds! During these lockups the system CPU time (i.e. time in the kernel) can be over 90% (yes: 90% of all 64 cores, not of just one CPU).

我的应用全天经常重启.重新启动应用程序不能解决问题.但是,重新启动计算机确实可以.

My app is restarted often throughout the day. Restarting the app does not fix the problem. However, rebooting the machine does.

问题1 :是什么原因导致90%的系统CPU时间持续10秒钟?系统的所有CPU时间都在我的父Python进程中,而不是在通过Python的多处理或其他进程创建的子进程中.因此,这意味着大约有60多个线程在内核中花费10秒钟以上的时间.我什至不确定这是Python问题还是Linux内核问题.

Question 1: What could cause 90% system CPU time for 10 seconds? All of the system CPU time is in my parent Python process, not in the child processes created through Python's multiprocessing or other processes. So that means something of the order of 60+ threads spending 10+ seconds in the kernel. I am not even sure if this is a Python issue or a Linux kernel issue.

问题2 :重新启动可以解决问题,这必须是引起问题的重要线索.在我的应用重新启动之间,但不是在重新启动之间,系统上可能剩下什么Linux资源,这可能导致此问题继续存在?

Question 2: That a reboot fixes the problem must be a big clue as to the cause. What Linux resources could be left exhausted on the system between my app restarting, but not between reboots, that could cause this problem to get stuck on?

以下,我将多次提到多处理.那是因为应用程序是在一个周期中运行的,而多处理仅在该周期的一部分中使用.高CPU几乎总是在所有多处理调用完成后 之后立即发生.我不确定这是提示原因还是鲱鱼.

Below I will mention multiprocessing a lot. That's because the application runs in a cycle and multiprocessing is only used in one part of the cycle. The high CPU almost always happens immediately after all the multiprocessing calls finish. I'm not sure if this is a hint at the cause or a red herring.

  • 我的应用程序运行一个使用psutil的线程,每0.5秒注销一次进程和系统CPU统计信息.我已经用top独立确认了它的报告.
  • 我已将我的应用程序从Python 2.7转换为Python 3.4,因为Python 3.2获得了新的GIL实现,并且3.4重写了多处理功能.尽管此方法有所改进,但并不能解决问题(请参见此处
  • 在其他计算机上进行测试非常困难,因为我运行的是一个有59个孩子的多进程池,并且周围没有其他64台核心计算机
  • 我不能使用线程而不是进程来运行它,因为由于GIL的原因,它不能足够快地运行(因此我为什么首先切换到多处理)
  • 我尝试在运行缓慢的一个线程上使用strace(它无法在所有线程上运行,因为它会使应用程序减慢太多).以下是我得到的并不能告诉我太多的信息.
  • ltrace不起作用,因为您不能在线程ID上使用-p.即使只是在主线程上运行它(没有-f),应用程序也是如此缓慢,以至于问题没有出现.
  • 问题与负载无关.有时它在满负载下会运行良好,然后在一半负载下会突然出现此问题.
  • 即使我每晚重新启动计算机,问题也每隔几天就会出现.
  • My app runs a thread that uses psutil to log out the process and system CPU stats every 0.5 seconds. I have independently confirmed what it's reporting with top.
  • I've converted my app from Python 2.7 to Python 3.4 because Python 3.2 got a new GIL implementation and 3.4 had the multiprocessing rewritten. While this improved things it did not solve the problem (see my previous SO question which I'm leaving because it's still a useful answer, if not the total answer).
  • I have replaced the OS. Originally it was Ubuntu 12 LTS, now it's CentOS 7. No difference.
  • It turns out multithreading and multiprocessing clash in Python/Linux and are not recommended together, Python 3.4 now has forkserver and spawn multiprocessing contexts. I've tried them, no difference.
  • I've checked /dev/shm to see if I'm running out of shared memory (which Python 3.4 uses to manage multiprocessing), nothing
  • lsof output listing all resource here
  • It's difficult to test on other machines because I run a multiprocess Pool of 59 children and I don't have any other 64 core machines just lying around
  • I can't run it using threads rather than processes because it just can't run fast enough due to the GIL (hence why I switched to multiprocessing in the first place)
  • I've tried using strace on just one thread that is running slow (it can't run across all threads because it slows the app far too much). Below is what I got which doesn't tell me much.
  • ltrace does not work because you can't use -p on a thread ID. Even just running it on the main thread (no -f) makes the app so slow that the problem doesn't show up.
  • The problem is not related to load. It will sometimes run fine at full load, and then later at half load, it'll suddenly get this problem.
  • Even if I reboot the machine nightly the problem comes back every couple of days.

环境/注意事项:

  • 从源代码编译的Python 3.4.3
  • CentOS 7完全是最新的. uname -a:Linux 3.10.0-229.4.2.el7.x86_64#1 SMP 5月13日星期三10:06:09 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux(尽管此内核更新仅在今天才应用)
  • 机器具有128GB的内存并有足够的可用空间
  • 我使用链接到ATLAS的numpy.我知道OpenBLAS与Python多处理程序发生冲突,但ATLAS并未发生冲突,并且该冲突已由我尝试过的Python 3.4的forkserverspawn解决.
  • 我使用OpenCV,它也可以做很多并行工作
  • 我使用ctypes访问相机制造商提供的C .so库
  • 应用程序以root身份运行(我链接到的C库的要求)
  • Python多处理Pool是在if __name__ == "__main__":保护的代码和主线程中创建的
  • Python 3.4.3 compiled from source
  • CentOS 7 totally up to date. uname -a: Linux 3.10.0-229.4.2.el7.x86_64 #1 SMP Wed May 13 10:06:09 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux (although this kernel update was only applied today)
  • Machine has 128GB of memory and has plenty free
  • I use numpy linked to ATLAS. I'm aware that OpenBLAS clashes with Python multiprocessing but ATLAS does not, and that clash is solved by Python 3.4's forkserver and spawn which I've tried.
  • I use OpenCV which also does a lot of parallel work
  • I use ctypes to access a C .so library provided by a camera manufacturer
  • App runs as root (a requirement of a C library I link to)
  • The Python multiprocessing Pool is created in code guarded by if __name__ == "__main__": and in the main thread

几次,我设法跟踪运行在100%系统" CPU上的线程.但是我只有一次从中得到任何有意义的东西.请参阅下面的通话,时间为10:24:12.446614,该通话需要1.4秒.鉴于您在其他大多数电话中都看到了相同的ID(0x7f05e4d1072c),我猜这是Python的GIL同步.这个猜测有意义吗?如果是这样,那么问题是为什么等待需要1.4秒?有人不释放GIL吗?

A few times I've managed to strace a thread that ran at 100% 'system' CPU. But only once have I gotten anything meaningful out of it. See below the call at 10:24:12.446614 that takes 1.4 seconds. Given it's the same ID (0x7f05e4d1072c) you see in most the other calls my guess would be this is Python's GIL synchronisation. Does this guess make sense? If so, then the question is why does the wait take 1.4 seconds? Is someone not releasing the GIL?

10:24:12.375456 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.000823>
10:24:12.377076 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.002419>
10:24:12.379588 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.001898>
10:24:12.382324 sched_yield()           = 0 <0.000186>
10:24:12.382596 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.004023>
10:24:12.387029 sched_yield()           = 0 <0.000175>
10:24:12.387279 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.054431>
10:24:12.442018 sched_yield()           = 0 <0.000050>
10:24:12.442157 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.003902>
10:24:12.446168 futex(0x7f05e4d1022c, FUTEX_WAKE, 1) = 1 <0.000052>
10:24:12.446316 futex(0x7f05e4d11cac, FUTEX_WAKE, 1) = 1 <0.000056>
10:24:12.446614 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <1.439739>
10:24:13.886513 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.002381>
10:24:13.889079 sched_yield()           = 0 <0.000016>
10:24:13.889135 sched_yield()           = 0 <0.000049>
10:24:13.889244 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.032761>
10:24:13.922147 sched_yield()           = 0 <0.000020>
10:24:13.922285 sched_yield()           = 0 <0.000104>
10:24:13.923628 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.002320>
10:24:13.926090 sched_yield()           = 0 <0.000018>
10:24:13.926244 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.000265>
10:24:13.926667 sched_yield()           = 0 <0.000027>
10:24:13.926775 sched_yield()           = 0 <0.000042>
10:24:13.926964 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000117>
10:24:13.927241 futex(0x7f05e4d110ac, FUTEX_WAKE, 1) = 1 <0.000099>
10:24:13.927455 futex(0x7f05e4d11d2c, FUTEX_WAKE, 1) = 1 <0.000186>
10:24:13.931318 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.000678>

推荐答案

在40多个线程显示100%系统" CPU时间的那一点上,我设法从gdb获取了线程转储.

I've managed to get a thread dump from gdb right at the point where 40+ threads are showing 100% 'system' CPU time.

以下是这些线程中每个线程的回溯信息:

Here's the backtrace which is the same for every one of those threads:

#0  0x00007fffebe9b407 in cv::ThresholdRunner::operator()(cv::Range const&) const () from /usr/local/lib/libopencv_imgproc.so.3.0
#1  0x00007fffecfe44a0 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, (anonymous namespace)::ProxyLoopBody, tbb::auto_partitioner const>::execute() () from /usr/local/lib/libopencv_core.so.3.0
#2  0x00007fffe967496a in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) () from /lib64/libtbb.so.2
#3  0x00007fffe96705a6 in tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2
#4  0x00007fffe966fc6b in tbb::internal::market::process(rml::job&) () from /lib64/libtbb.so.2
#5  0x00007fffe966d65f in tbb::internal::rml::private_worker::run() () from /lib64/libtbb.so.2
#6  0x00007fffe966d859 in tbb::internal::rml::private_worker::thread_routine(void*) () from /lib64/libtbb.so.2
#7  0x00007ffff76e9df5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007ffff6d0e1ad in clone () from /lib64/libc.so.6

我最初的问题是将Python和Linux放在首位,但问题似乎在于TBB和/或OpenCV.由于带有TBB的OpenCV的使用非常广泛,我想它也必须以某种方式涉及到与我的特定环境的相互作用.也许是因为它是64核计算机?

My original question put Python and Linux front and center but the issue appears to lie with TBB and/or OpenCV. Since OpenCV with TBB is so widely used I presume it has to also involve the interplay with my specific environment somehow. Maybe because it's a 64 core machine?

我已经在关闭TBB的情况下重新编译了OpenCV,到目前为止,该问题尚未再出现.但是我的应用现在运行速度较慢.

I have recompiled OpenCV with TBB turned off and the problem has not reappeared so far. But my app now runs slower.

我已经将其发布为OpenCV的错误,并将使用所有后续内容更新此答案从那开始.

I have posted this as a bug to OpenCV and will update this answer with anything that comes from that.

这篇关于Python/OpenCV应用程序锁定问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆