如何在没有有用的调用堆栈的情况下调试难以重现的崩溃? [英] How do I debug a difficult-to-reproduce crash with no useful call stack?

查看:12
本文介绍了如何在没有有用的调用堆栈的情况下调试难以重现的崩溃?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在我们的软件中遇到了一个奇怪的崩溃,我在调试它时遇到了很多麻烦,所以我寻求 SO 的建议来解决它.

I am encountering an odd crash in our software and I'm having a lot of trouble debugging it, and so I am seeking SO's advice on how to tackle it.

崩溃是读取 NULL 指针的访问冲突:

The crash is an access violation reading a NULL pointer:

$00CF0041 的第一次机会例外.带有消息的异常类 $C0000005'在 0x00cf0041 的访问冲突:读取地址为 0x00000000'.

First chance exception at $00CF0041. Exception class $C0000005 with message 'access violation at 0x00cf0041: read of address 0x00000000'.

它只发生在有时"——我还没有弄清楚什么时候发生的任何韵律或原因——而且只发生在主线程中.当它发生时,调用堆栈包含一个不正确的条目:

It only happens 'sometimes' - I haven't managed to figure out any rhyme or reason, yet, for when - and only in the main thread. When it occurs, the call stack contains one incorrect entry:

对于主线程,它应该显示一个装满其他项目的大堆栈.

For the main thread, which this is, it should show a large stack full of other items.

此时,所有其他线程都处于非活动状态(主要位于 WaitForSingleObject 或类似函数中.)我只在主线程中看到过这种崩溃.在同一个地址的同一个方法中,它始终具有一个条目的相同调用堆栈.此方法可能相关,也可能不相关——我们确实在应用程序中使用了 VCL.不过,我敢打赌,某些东西(可能是很久以前)正在破坏堆栈,而它崩溃的地址实际上是随机的.请注意,它在多个构建中一直是相同的地址 - 它可能不是真正随机的.

At this point, all other threads are inactive (mostly sitting in WaitForSingleObject or a similar function.) I have only seen this crash occur in the main thread. It always has the same call stack of one entry, in the same method at the same address. This method may or may not be related - we do use the VCL in our application. My bet, though, is that something (possibly quite a while ago) is corrupting the stack, and the address where it's crashing is effectively random. Note it has been the same address across several builds, though - it's probably not truly random.

这是我尝试过的:

  • 试图在某个时间点可靠地重现它.我没有发现任何东西每次都可以重现它,还有一些偶尔会做或不做的事情,没有明显的原因.这些操作不足以将其缩小到特定的代码部分.这可能与时间有关,但在 IDE 中断时,其他线程通常什么都不做.我不能排除线程问题,但认为不太可能.
  • 使用额外的调试语句(额外的调试信息、额外的断言等)构建.这样做后,崩溃就不会发生.
  • 在启用 Codeguard 的情况下进行构建.这样做之后,崩溃就不会发生,Codeguard 也不会显示任何错误.
  • Trying to reproduce it reliably at a certain point. I have found nothing that reproduces it every time, and a couple of things that occasionally do, or do not, for no apparent reason. These are not 'narrow' enough actions to narrow it down to a particular section of code. It may be timing related, but at the point the IDE breaks in, other threads are usually doing nothing. I can't rule out a threading problem, but think it's unlikely.
  • Building with extra debugging statements (extra debug info, extra asserts, etc.) After doing so, the crash never occurs.
  • Building with Codeguard enabled. After doing so, the crash never occurs and Codeguard shows no errors.

我的问题:

1.如何找到导致崩溃的代码?我该怎么做相当于往回走?

<强>2.对于如何追踪此次崩溃的原因,您有什么一般性建议?

我正在使用 Embarcadero RAD Studio 2010(该项目主要包含 C++ Builder 代码和少量的Delphi.)

I am using Embarcadero RAD Studio 2010 (the project mostly contains C++ Builder code and small amounts of Delphi.)

我想我应该添加实际造成这种情况的原因.有一个线程叫做 ReadDirectoryChangesW 然后,使用 GetOverlappedResult 等待事件继续并处理更改.该事件也被发出信号,以便在设置状态标志后终止线程.问题是当线程退出时它从未调用 CancelIO.结果,Windows 仍在跟踪更改,并且在目录更改时可能仍在写入缓冲区,即使缓冲区、重叠结构和事件不再存在(创建它们的线程上下文也不存在).当 CancelIO 被调用,没有更多的崩溃.

I thought I should add what actually caused this. There was a thread that called ReadDirectoryChangesW and then, using GetOverlappedResult, waited on an event to continue and do something with the changes. The event was also signalled in order to terminate the thread after setting a status flag. The problem was that when the thread exited it never called CancelIO. As a result, Windows was still tracking changes and probably still writing to the buffer when the directory changed, even though the buffer, overlapped structure and event no longer existed (nor did the thread context in which they were created.) When CancelIO was called, there were no more crashes.

推荐答案

即使 IDE 提供的堆栈跟踪不是很完整,这并不意味着堆栈上仍然没有有用的信息.打开 CPU 视图并查看堆栈窗格;对于每个 CALL 操作码,都会将返回地址压入堆栈.由于堆栈向下增长,您会在当前堆栈位置上方找到这些返回地址,即通过在堆栈窗格中向上滚动.

Even when the IDE-provided stack trace isn't very complete, that doesn't mean there isn't still useful information on the stack. Open up the CPU view and check out the stack pane; for every CALL opcode, a return address is pushed on the stack. Since the stack grows downwards, you'll find these return addresses above the current stack location, i.e. by scrolling upwards in the stack pane.

主线程的堆栈将在 $00120000 或 $00180000 左右(Vista 及更高版本中的地址空间随机化使其更加随机).主要可执行文件的代码将在 00400000 美元左右.您可以通过右键单击堆栈条目并选择 Follow -> Near Code 来推测性地调查堆栈中看起来不像整数数据(低值)或堆栈地址($00120000+ 范围)的元素,这将导致反汇编窗口跳转到该代码地址.如果它看起来像无效代码,则它可能不是堆栈跟踪中的有效条目.如果它是有效代码,它可能是 OS 代码(通常约为 77000000 美元及以上),在这种情况下,您将没有有意义的符号,但您经常会遇到实际正确的堆栈条目.

The stack for the main thread will be somewhere around $00120000 or $00180000 (address space randomization in Vista and upwards has made it more random). Code for the main executable will be somewhere around $00400000. You can speculatively investigate elements on the stack that don't look like integer data (low values) or stack addresses ($00120000+ range) by right-clicking on the stack entry and selecting Follow -> Near Code, which will cause the disassembly window to jump to that code address. If it looks like invalid code, it's probably not a valid entry in the stack trace. If it's valid code, it may be OS code (frequently around $77000000 and above) in which case you won't have meaningful symbols, but every so often you'll hit on an actual proper stack entry.

这种技术虽然有些费力,但可以在调试器无法跟踪事物时为您提供有意义的堆栈跟踪信息.但是,如果 ESP(堆栈指针)被搞砸了,它对您没有帮助.幸运的是,这种情况很少见.

This technique, though somewhat laborious, can get you meaningful stack trace info when the debugger isn't able to trace things through. It doesn't help you if ESP (the stack pointer) has been screwed with, though. Fortunately, that's pretty rare.

这篇关于如何在没有有用的调用堆栈的情况下调试难以重现的崩溃?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆