无限的abort()在c ++程序核心转储的后退 [英] infinite abort() in a backrace of a c++ program core dump

查看:156
本文介绍了无限的abort()在c ++程序核心转储的后退的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个奇怪的问题,我无法解决。请帮助!



该程序是在ARM Linux机器上运行的多线程c ++应用程序。最近我开始测试它的长期运行,有时它会在1-2天后崩溃,如下所示:

  *** glibc检测** / root / client / my_program:free():invalid pointer:0x002a9408 *** 

我打开核心转储我看到主线程似乎有一个损坏的堆栈:我可以看到的是无限的emort abort()调用。

  GNU gdb(GDB)7.3 
...
这个GDB被配置为--host = i686 --target = arm-linux。
[新LWP 706]
[新LWP 700]
[新LWP 702]
[新LWP 703]
[新LWP 704]
[新的LWP 705]
核心由`/ root / client / my_program'生成。
程序终止与信号6,中止。
#0 0x001c44d4在raise()
(gdb)bt
#0 0x001c44d4在raise()
#1 0x001c47e0 in abort()
#2 0x001c47e0 in abort ()
#3 abort()中的0x001c47e0 abob()
#4 abort()中的0x001c47e0 abort()中的0x001c47e0 abort()中的0x001c47e0 abort()中的0x001c47e0
#7 0x001c47e0 in abort()
#8 0x001c47e0 in abort()
#9 0x001c47e0 in abort()
#10 0x001c47e0 in abort()
#11 0x001c47e0 in abort( )

它继续下去。我试图通过向上移动堆栈: frame 3000 甚至更多,但是最终核心转储用完了帧,我仍然看不到为什么发生这种情况。 p>

当我检查其他线程时,一切似乎都是正常的。

 (gdb)信息线程
Id目标ID框
6 LWP 705 0x00132f04 in nanosleep()
5 LWP 704 0x001e7a70 in select()
4 LWP 703 0x00132f04 in nanosleep()
3 LWP 702 0x00132318 in sem_wait()
2 LWP 700 0x00132f04 in nanosleep()
* 1 LWP 706 0x001c44d4在raise()
(gdb)线程5
[切换到线程5(LWP 704)]
#0 0x001e7a70在select()
(gdb )bt
#0 select()中的0x001e7a70 CSerialPort.cpp中的CSerialPort :: read(this = 0xbea7d98c,string_buffer = ...,delimiter = ...,timeout_ms = 1000)中的
#1 0x00057ad4 :202
#2 0x000070de4在CScanner :: readResponse(this = 0xbea7d4cc,resp_recv = ...,timeout = 1000,delim = ...)在PidScanner.cpp:657
#3 0x00071198在CScanner在PidScanner.cpp中,senddpect(this = 0xbea7d4cc,cmd = ...,exp_str = ...,rcv_str = ...,timeout = 1000)在CScanner :: pollPid(this)中的$ 604
#4 0x00071d48 = 0xbe PidScanner.cpp:525
#5 CScanner :: poll1(this = 0xbea7d4cc)中的0x00072ce0
#6 CScanner中的0x00074c78 :: cd = 1,pid = 12,pid_str = ...)在CThread5 :: run(this = 0xbea7d360)
#9 0x00088698在CThread中的轮询(this = 0xbea7d4cc)
#7 0x00089edc在CThread5 :: Thread5Poll(this = 0xbea7d360)
#8 0x0008c140在CThread5 :: run :threadFunc(p = 0xbea7d360)
#10 start_thread()中的0x0012e6a0
#11克隆()中的0x001e90e8
#12克隆()中的0x001e90e8
回溯跟踪:前一帧相同到这个框架(损坏的堆栈?)

(类和函数名称有点wierd,因为我改变了 - :)
所以,线程#1是堆栈损坏的地方,每个其他的追溯(2-6)显示

  Backtrace停止:前一帧与此帧相同(损坏堆栈?)。 

它发生,因为线程2-6在线程#1中创建。



事实是,我无法在gdb中运行程序,因为它在嵌入式系统上运行。我不能使用远程gdb服务器。唯一的选择是检查不太常发生的核心转储。



你能建议一些可以让我前进的东西吗? (也许我可以从核心转储中提取一些东西,或者可能以某种方式在代码中捕获一个钩子来捕获 abort()调用)。



更新: Basile Starynkevitch 建议使用Valgrind,但事实证明它仅适用于ARMv7。我有ARM 926是ARMv5,所以这对我来说不行。有一些努力来编译ARMv5的valgrind,但是: ARMv5tel的Valgrind交叉编译, ARM9上的href =https://stackoverflow.com/a/6575278/4378> valgrind



更新2:无法使电栅栏与我的程序一起工作。该程序使用C ++和pthreads。我得到的Efence版本, 2.1.13 在我启动一个线程后崩溃在任意的地方,尝试做一些或多或少复杂的事情(例如,将值放入STL向量)。我看到人们在网络上提到了Efence的一些补丁,但没有时间去尝试。我在我的Linux PC上尝试过,而不是在ARM上,而像valgrind或者Dmalloc这样的其他工具也不会报告代码的任何问题。所以,使用版本2.1.13的efence的人都应该准备好使用pthreads(或者pthread + C ++ + STL,不知道)的问题。

解决方案

我对无限中止的猜测是,abort()导致循环(例如abort - > signal handler - > abort - > ...)或者gdb无法正确解释帧在任何一种情况下,我建议您手动检出有问题的线程的堆栈。如果中止引起循环,您应该看到一个模式,或者至少返回地址中止重复,或许您可以通过手动跳过(重复)堆栈的大部分来更容易地找到问题的根源。



否则,你应该发现没有重复的模式,希望在堆栈中的某个地方的失败函数的返回地址。在最坏的情况下,这样的地址由于缓冲区溢出等被覆盖,但是也许你仍然可以幸运并识别它被覆盖。


I have a strange problem that I can't solve. Please help!

The program is a multithreaded c++ application that runs on ARM Linux machine. Recently I began testing it for the long runs and sometimes it crashes after 1-2 days like so:

*** glibc detected ** /root/client/my_program: free(): invalid pointer: 0x002a9408 ***

When I open core dump I see that the main thread it seems has a corrupt stack: all I can see is infinite abort() calls.

GNU gdb (GDB) 7.3 
...
This GDB was configured as "--host=i686 --target=arm-linux".
[New LWP 706]
[New LWP 700]
[New LWP 702]
[New LWP 703]
[New LWP 704]
[New LWP 705]
Core was generated by `/root/client/my_program'.
Program terminated with signal 6, Aborted.
#0  0x001c44d4 in raise ()
(gdb) bt
#0  0x001c44d4 in raise ()
#1  0x001c47e0 in abort ()
#2  0x001c47e0 in abort ()
#3  0x001c47e0 in abort ()
#4  0x001c47e0 in abort ()
#5  0x001c47e0 in abort ()
#6  0x001c47e0 in abort ()
#7  0x001c47e0 in abort ()
#8  0x001c47e0 in abort ()
#9  0x001c47e0 in abort ()
#10 0x001c47e0 in abort ()
#11 0x001c47e0 in abort ()

And it goes on and on. I tried to get to the bottom of it by moving up the stack: frame 3000 or even more, but eventually core dump runs out of frames and I still can't see why this has happened.

When I examine the other threads everything seems normal there.

(gdb) info threads
  Id   Target Id         Frame 
  6    LWP 705           0x00132f04 in nanosleep ()
  5    LWP 704           0x001e7a70 in select ()
  4    LWP 703           0x00132f04 in nanosleep ()
  3    LWP 702           0x00132318 in sem_wait ()
  2    LWP 700           0x00132f04 in nanosleep ()
* 1    LWP 706           0x001c44d4 in raise ()
(gdb) thread 5
[Switching to thread 5 (LWP 704)]
#0  0x001e7a70 in select ()
(gdb) bt
#0  0x001e7a70 in select ()
#1  0x00057ad4 in CSerialPort::read (this=0xbea7d98c, string_buffer=..., delimiter=..., timeout_ms=1000) at CSerialPort.cpp:202
#2  0x00070de4 in CScanner::readResponse (this=0xbea7d4cc, resp_recv=..., timeout=1000, delim=...) at PidScanner.cpp:657
#3  0x00071198 in CScanner::sendExpect (this=0xbea7d4cc, cmd=..., exp_str=..., rcv_str=..., timeout=1000) at PidScanner.cpp:604
#4  0x00071d48 in CScanner::pollPid (this=0xbea7d4cc, mode=1, pid=12, pid_str=...) at PidScanner.cpp:525
#5  0x00072ce0 in CScanner::poll1 (this=0xbea7d4cc) 
#6  0x00074c78 in CScanner::Poll (this=0xbea7d4cc) 
#7  0x00089edc in CThread5::Thread5Poll (this=0xbea7d360) 
#8  0x0008c140 in CThread5::run (this=0xbea7d360) 
#9  0x00088698 in CThread::threadFunc (p=0xbea7d360) 
#10 0x0012e6a0 in start_thread ()
#11 0x001e90e8 in clone ()
#12 0x001e90e8 in clone ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(Classes and functions names are a bit wierd because I changed them -:) So, thread #1 is where the stack is corrupt, backtrace of every other (2-6) shows

Backtrace stopped: previous frame identical to this frame (corrupt stack?).

It happends because threads 2-6 are created in the thread #1.

The thing is that I can't run the program in gdb because it runs on an embedded system. I can't use remote gdb server. The only option is examining core dumps that occur not very often.

Could you please suggest something that could move me forward with this? (Maybe something else I can extract from the core dump or maybe somehow to make some hooks in the code to catch abort() call).

UPDATE: Basile Starynkevitch suggested to use Valgrind, but turns out it's ported only for ARMv7. I have ARM 926 which is ARMv5, so this won't work for me. There are some efforts to compile valgrind for ARMv5 though: Valgrind cross compilation for ARMv5tel, valgrind on the ARM9

UPDATE 2: Couldn't make Electric Fence work with my program. The program uses C++ and pthreads. The version of Efence I got, 2.1.13 crashed in a arbitrary place after I start a thread and try to do something more or less complicated (for example to put a value into an STL vector). I saw people mentioning some patches for Efence on the web but didn't have time to try them. I tried this on my Linux PC, not on the ARM, and other tools like valgrind or Dmalloc don't report any problems with the code. So, everyone using version 2.1.13 of efence be prepared to have problems with pthreads (or maybe pthread + C++ + STL, don't know).

解决方案

My guess for the "infinite' aborts is that either abort() causes a loop (e.g. abort -> signal handler -> abort -> ...) or that gdb can't correctly interpret the frames on the stack.

In either case I would suggest manually checking out the stack of the problematic thread. If abort causes a loop, you should see a pattern or at least the return address of abort repeating every so often. Perhaps you can then more easily find the root of the problem by manually skipping large parts of the (repeating) stack.

Otherwise, you should find that there is no repeating pattern and hopefully the return address of the failing function somewhere on the stack. In the worst case such addresses are overwritten due to a buffer overflow or such, but perhaps then you can still get lucky and recognise what it is overwritten with.

这篇关于无限的abort()在c ++程序核心转储的后退的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆