c++ 程序核心转储中的无限中止() [英] infinite abort() in a backrace of a c++ program core dump

查看:21
本文介绍了c++ 程序核心转储中的无限中止()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个无法解决的奇怪问题.请帮忙!

该程序是一个在 ARM Linux 机器上运行的多线程 c++ 应用程序.最近我开始长期测试它,有时它会在 1-2 天后崩溃,如下所示:

*** 检测到 glibc **/root/client/my_program: free(): invalid pointer: 0x002a9408 ***

当我打开核心转储时,我看到主线程似乎有一个损坏的堆栈:我只能看到无限的 abort() 调用.

GNU gdb (GDB) 7.3...此 GDB 配置为--host=i686 --target=arm-linux".[新 LWP 706][新 LWP 700][新 LWP 702][新 LWP 703][新 LWP 704][新 LWP 705]核心是由 `/root/client/my_program' 生成的.程序以信号 6 终止,Aborted.#0 0x001c44d4 in raise ()(gdb) BT#0 0x001c44d4 in raise ()#1 0x001c47e0 在中止()#2 0x001c47e0 在中止()#3 0x001c47e0 在中止()#4 0x001c47e0 在中止()#5 0x001c47e0 在中止()#6 0x001c47e0 在中止()#7 0x001c47e0 在中止()#8 0x001c47e0 在中止()#9 0x001c47e0 在中止()#10 0x001c47e0 在中止()#11 0x001c47e0 在中止()

而且它一直在继续.我试图通过向上移动堆栈来找到它的底部:frame 3000 甚至更多,但最终核心转储用完了帧,我仍然不明白为什么会发生这种情况.p>

当我检查其他线程时,那里的一切似乎都很正常.

(gdb) 信息线程Id 目标 Id 框架6 LWP 705 0x00132f04 在 nanosleep ()5 LWP 704 0x001e7a70 在选择 ()4 LWP 703 0x00132f04 在 nanosleep ()3 LWP 702 0x00132318 在 sem_wait()2 LWP 700 0x00132f04 在 nanosleep ()* 1 LWP 706 0x001c44d4 in raise ()(gdb) 线程 5[切换到线程 5 (LWP 704)]#0 0x001e7a70 在选择()(gdb) BT#0 0x001e7a70 在选择()#1 0x00057ad4 in CSerialPort::read (this=0xbea7d98c, string_buffer=..., delimiter=..., timeout_ms=1000) at CSerialPort.cpp:202#2 0x00070de4 in CScanner::readResponse (this=0xbea7d4cc, resp_recv=..., timeout=1000, delim=...) 在 PidScanner.cpp:657#3 0x00071198 in CScanner::sendExpect (this=0xbea7d4cc, cmd=..., exp_str=..., rcv_str=..., timeout=1000) 在 PidScanner.cpp:604#4 0x00071d48 in CScanner::pollPid (this=0xbea7d4cc, mode=1, pid=12, pid_str=...) at PidScanner.cpp:525#5 0x00072ce0 在 CScanner::poll1 (this=0xbea7d4cc)#6 0x00074c78 在 CScanner::Poll (this=0xbea7d4cc)#7 0x00089edc 在 CThread5::Thread5Poll (this=0xbea7d360)#8 0x0008c140 在 CThread5::run (this=0xbea7d360)#9 0x00088698 在 CThread::threadFunc (p=0xbea7d360)#10 0x0012e6a0 in start_thread()#11 0x001e90e8 在克隆()#12 0x001e90e8 在克隆()回溯停止:前一帧与此帧相同(损坏的堆栈?)

(类和函数名称有点奇怪,因为我更改了它们 -:)所以,线程 #1 是堆栈损坏的地方,每隔一个 (2-6) 的回溯显示

Backtrace 停止:前一帧与此帧相同(损坏的堆栈?).

这是因为线程 2-6 是在线程 #1 中创建的.

问题是我无法在 gdb 中运行该程序,因为它运行在嵌入式系统上.我不能使用远程 gdb 服务器.唯一的选择是检查不经常发生的核心转储.

您能否提出一些可以推动我前进的建议?(也许我可以从核心转储中提取其他内容,或者以某种方式在代码中制作一些挂钩以捕获 abort() 调用).

更新:Basile Starynkevitch 建议使用 Valgrind,但事实证明它仅适用于 ARMv7.我有 ARM 926,它是 ARMv5,所以这对我不起作用.虽然有一些努力为 ARMv5 编译 valgrind:为 ARMv5tel 进行 Valgrind 交叉编译ARM9上的valgrind

更新 2:无法使 Electric Fence 与我的程序一起使用.该程序使用 C++ 和 pthreads.我得到的 Efence 版本 2.1.13 在我启动线程后在任意位置崩溃并尝试做一些或多或少复杂的事情(例如将一个值放入 STL 向量中).我看到有人在网上提到 Efence 的一些补丁,但没有时间尝试它们.我在我的 Linux PC 上试过这个,而不是在 ARM 上,其他工具,如 valgrind 或 Dmalloc 没有报告代码有任何问题.所以,使用 2.1.13 版本的 efence 的每个人都准备好遇到 pthread 问题(或者可能是 pthread + C++ + STL,不知道).

解决方案

我对无限"中止的猜测是 abort() 会导致循环(例如 abort -> signal handler -> abort -> ...)或者 gdb 无法正确解释堆栈上的帧.

无论哪种情况,我都建议手动检查有问题的线程的堆栈.如果 abort 导致循环,您应该看到一个模式或至少 abort 的返回地址每隔一段时间重复一次.也许您可以通过手动跳过(重复)堆栈的大部分来更轻松地找到问题的根源.

否则,您应该会发现没有重复的模式,并且希望堆栈中某处的失败函数的返回地址.在最坏的情况下,由于缓冲区溢出等原因,这些地址会被覆盖,但也许你仍然可以幸运地识别出它被覆盖的内容.

I have a strange problem that I can't solve. Please help!

The program is a multithreaded c++ application that runs on ARM Linux machine. Recently I began testing it for the long runs and sometimes it crashes after 1-2 days like so:

*** glibc detected ** /root/client/my_program: free(): invalid pointer: 0x002a9408 ***

When I open core dump I see that the main thread it seems has a corrupt stack: all I can see is infinite abort() calls.

GNU gdb (GDB) 7.3 
...
This GDB was configured as "--host=i686 --target=arm-linux".
[New LWP 706]
[New LWP 700]
[New LWP 702]
[New LWP 703]
[New LWP 704]
[New LWP 705]
Core was generated by `/root/client/my_program'.
Program terminated with signal 6, Aborted.
#0  0x001c44d4 in raise ()
(gdb) bt
#0  0x001c44d4 in raise ()
#1  0x001c47e0 in abort ()
#2  0x001c47e0 in abort ()
#3  0x001c47e0 in abort ()
#4  0x001c47e0 in abort ()
#5  0x001c47e0 in abort ()
#6  0x001c47e0 in abort ()
#7  0x001c47e0 in abort ()
#8  0x001c47e0 in abort ()
#9  0x001c47e0 in abort ()
#10 0x001c47e0 in abort ()
#11 0x001c47e0 in abort ()

And it goes on and on. I tried to get to the bottom of it by moving up the stack: frame 3000 or even more, but eventually core dump runs out of frames and I still can't see why this has happened.

When I examine the other threads everything seems normal there.

(gdb) info threads
  Id   Target Id         Frame 
  6    LWP 705           0x00132f04 in nanosleep ()
  5    LWP 704           0x001e7a70 in select ()
  4    LWP 703           0x00132f04 in nanosleep ()
  3    LWP 702           0x00132318 in sem_wait ()
  2    LWP 700           0x00132f04 in nanosleep ()
* 1    LWP 706           0x001c44d4 in raise ()
(gdb) thread 5
[Switching to thread 5 (LWP 704)]
#0  0x001e7a70 in select ()
(gdb) bt
#0  0x001e7a70 in select ()
#1  0x00057ad4 in CSerialPort::read (this=0xbea7d98c, string_buffer=..., delimiter=..., timeout_ms=1000) at CSerialPort.cpp:202
#2  0x00070de4 in CScanner::readResponse (this=0xbea7d4cc, resp_recv=..., timeout=1000, delim=...) at PidScanner.cpp:657
#3  0x00071198 in CScanner::sendExpect (this=0xbea7d4cc, cmd=..., exp_str=..., rcv_str=..., timeout=1000) at PidScanner.cpp:604
#4  0x00071d48 in CScanner::pollPid (this=0xbea7d4cc, mode=1, pid=12, pid_str=...) at PidScanner.cpp:525
#5  0x00072ce0 in CScanner::poll1 (this=0xbea7d4cc) 
#6  0x00074c78 in CScanner::Poll (this=0xbea7d4cc) 
#7  0x00089edc in CThread5::Thread5Poll (this=0xbea7d360) 
#8  0x0008c140 in CThread5::run (this=0xbea7d360) 
#9  0x00088698 in CThread::threadFunc (p=0xbea7d360) 
#10 0x0012e6a0 in start_thread ()
#11 0x001e90e8 in clone ()
#12 0x001e90e8 in clone ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(Classes and functions names are a bit wierd because I changed them -:) So, thread #1 is where the stack is corrupt, backtrace of every other (2-6) shows

Backtrace stopped: previous frame identical to this frame (corrupt stack?).

It happends because threads 2-6 are created in the thread #1.

The thing is that I can't run the program in gdb because it runs on an embedded system. I can't use remote gdb server. The only option is examining core dumps that occur not very often.

Could you please suggest something that could move me forward with this? (Maybe something else I can extract from the core dump or maybe somehow to make some hooks in the code to catch abort() call).

UPDATE: Basile Starynkevitch suggested to use Valgrind, but turns out it's ported only for ARMv7. I have ARM 926 which is ARMv5, so this won't work for me. There are some efforts to compile valgrind for ARMv5 though: Valgrind cross compilation for ARMv5tel, valgrind on the ARM9

UPDATE 2: Couldn't make Electric Fence work with my program. The program uses C++ and pthreads. The version of Efence I got, 2.1.13 crashed in a arbitrary place after I start a thread and try to do something more or less complicated (for example to put a value into an STL vector). I saw people mentioning some patches for Efence on the web but didn't have time to try them. I tried this on my Linux PC, not on the ARM, and other tools like valgrind or Dmalloc don't report any problems with the code. So, everyone using version 2.1.13 of efence be prepared to have problems with pthreads (or maybe pthread + C++ + STL, don't know).

解决方案

My guess for the "infinite' aborts is that either abort() causes a loop (e.g. abort -> signal handler -> abort -> ...) or that gdb can't correctly interpret the frames on the stack.

In either case I would suggest manually checking out the stack of the problematic thread. If abort causes a loop, you should see a pattern or at least the return address of abort repeating every so often. Perhaps you can then more easily find the root of the problem by manually skipping large parts of the (repeating) stack.

Otherwise, you should find that there is no repeating pattern and hopefully the return address of the failing function somewhere on the stack. In the worst case such addresses are overwritten due to a buffer overflow or such, but perhaps then you can still get lucky and recognise what it is overwritten with.

这篇关于c++ 程序核心转储中的无限中止()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆