芹菜工人挂起而没有任何错误 [英] Celery worker hangs without any error
问题描述
我有一个运行芹菜工作者的生产设置,用于向远程服务发出POST/GET请求并存储结果,它每15分钟处理大约20k任务.
I have a production setup for running celery workers for making a POST / GET request to remote service and storing result, It is handling load around 20k tasks per 15 min.
问题在于工人无缘无故麻木,没有错误,没有警告.
The problem is that the workers go numb for no reason, no errors, no warnings.
我也尝试过添加多处理,同样的结果.
I have tried adding multiprocessing also, the same result.
在日志中,我看到执行任务的时间增加了,就像在s中成功
In log I see the increase in the time of executing task, like succeeded in s
有关更多详细信息,请参见 https://github.com/celery/celery/issues/2621
For more details look at https://github.com/celery/celery/issues/2621
推荐答案
如果您的芹菜工人有时被卡住,则可以使用 strace&lsof
找出卡在哪个系统调用上.
If your celery worker get stuck sometimes, you can use strace & lsof
to find out at which system call it get stuck.
例如:
$ strace -p 10268 -s 10000
Process 10268 attached - interrupt to quit
recvfrom(5,
10268是celery工作者的pid, recvfrom(5
表示该工作者停止从文件描述符接收数据.
10268 is the pid of celery worker, recvfrom(5
means the worker stops at receiving data from file descriptor.
然后,您可以使用 lsof
来检查此工作进程中的 5
.
Then you can use lsof
to check out what is 5
in this worker process.
lsof -p 10268
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
......
celery 10268 root 5u IPv4 828871825 0t0 TCP 172.16.201.40:36162->10.13.244.205:wap-wsp (ESTABLISHED)
......
它指示工作人员被卡在tcp连接上(您可以在 FD
列中看到 5u
).
It indicates that the worker get stuck at a tcp connection(you can see 5u
in FD
column).
诸如 requests
之类的某些python软件包被阻止以等待来自对等方的数据,这可能会导致celery worker挂起,如果您使用的是 requests
,请确保设置超时
参数.
Some python packages like requests
is blocking to wait data from peer, this may cause celery worker hangs, if you are using requests
, please make sure to set timeout
argument.
您是否看到过此页面:
https://www.caktusgroup.com/博客/2013/10/30/using-strace-debug-stuck-celery-tasks/
这篇关于芹菜工人挂起而没有任何错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!