strace 修复挂起的进程 [英] strace fixes hung process

查看:79
本文介绍了strace 修复挂起的进程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个单线程的 unix 进程,它通过 tcp 与其他进程通信.

I have a single threaded unix process that communicates over tcp with other processes.

问题如下.当我启动该进程时,它会挂起(没有忙循环),直到我将其杀死.

The problem is the following. When I start up the process it hangs (no busy loop) until I kill it.

有趣的是,一旦我用 strace 附加到它,它就会继续以预期的行为运行,就好像根本没有问题一样.(始终可重现)

The funny thing is, as soon as I attach with strace to it, it continues to run with the expected behavior as if there was no problem at all. (always reproducible)

这种行为的原因是什么?strace 对进程的状态有什么影响?

What could be the reason for this behavior? What effect has strace on the state ob a process?

更新:strace 改变行为的原因是,因为我们使用了带有 bug 的 openonload.一旦我们附加了 strace,堆栈就被移回内核,问题就消失了.

Update: The cause of strace changing the behavior was, because we used openonload with a bug. As soon as we attached strace, the stack was moved back to the kernel and the problem was gone.

推荐答案

最有可能的是,strace 输出只是减慢了进程,从而大大降低了死锁的可能性.我以前用 strace 看到过这种情况,或者在添加其他调试打印或调试调用时会发生这种情况.

Most likely that strace output simply slows down the process making deadlocks much less likely. I have seen this happen before with strace OR can happen when adding other debug printing or debug calls.

死锁最常见于多线程交互.但是在您的情况下,您有多个进程.如果 strace 每次都释放进程,那么我猜您打开套接字或在套接字上握手的方式就是挂起的.我认为套接字上的缓冲和阻塞可能会使您进入进程死锁状态.

Deadlocks most often seen with multi-threaded interaction. But in your case you have multiple processes. If the strace frees up the processes every time then I guess the way you open the sockets or handshake on the socket is what is hanging. Buffering and blocking on the socket I think could be getting you into a process-deadlocked state.

类似的问题,但多线程进程,线程之间而不是单独进程之间死锁:使用 strace 修复挂起的内存问题

Similar question but with a multi-threaded process, deadlock between threads instead of between seperate processes: Using strace fixes hung memory issue

难以概括示例,尤其是不知道您的不同进程在做什么或者它们是否以某种方式共享资源时?我会尝试 ...

Hard to generalise examples, especially as don't know what your different processes are doing or if they're sharing resources in some way? I will try . . .

  1. 具有一个应受保护的对象/资源的示例:
    一个进程开始对对象进行更改(例如,将项目添加到列表/数据库表中)
    另一个进程开始迭代列表/表.
    其中一个迭代循环的进程被混淆并且永远不会退出或做一些更糟糕的事情(例如写入无效内存)的危险.

  1. Example with one object/resource which should be protected:
    One process starts making changes on an object (e.g. adding items to a list/db table)
    Another process starts iterating the list/table.
    Danger of one of those processes iterating loop being confused and never exiting OR doing something worse like writing to invalid memory.

对象/资源受互斥锁保护的示例
具有两个资源的经典简单死锁问题.~ 比哲学家进餐更简单
一个线程/进程获取对象 A 上的互斥锁,做了一些工作.
另一个线程/进程在对象 B 上获取互斥锁,做了一些工作.
同一个线程/进程需要更新对象A,等待A的互斥锁.
原线程/进程需要访问对象B,等待B上的互斥锁.
............@...........
除了风的噪音和风滚草在风景中吹拂之外,寂静无声.
死锁.

Example where object/resource is protected by mutexes
The classic simple deadlock with two resources problem. ~ simpler than dining philosophers
One thread/process grabs mutex on object A, does some work.
Another thread/process grabs mutex on object B, does some work.
Same thread/process needs to update object A, waits for mutex for A.
Original thread/process needs to access object B, waits for mutex on B.
. . . . . . . . . . . . @ . . . . . . . . . . .
Silence except for the noise of the wind and a tumbleweed blowing across the landscape.
Deadlocked.

这篇关于strace 修复挂起的进程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆