如何调试难以重现的崩溃,没有有用的调用堆栈? [英] How do I debug a difficult-to-reproduce crash with no useful call stack?

查看:212
本文介绍了如何调试难以重现的崩溃,没有有用的调用堆栈?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在软件中遇到了一个奇怪的事故,调试时遇到了很多麻烦,所以我非常想知道如何解决这个问题。

I am encountering an odd crash in our software and I'm having a lot of trouble debugging it, and so I am seeking SO's advice on how to tackle it.

崩溃是一个读取NULL指针的访问冲突:

The crash is an access violation reading a NULL pointer:


$ 00CF0041的第一次机会异常。
异常类$ C0000005消息
'访问冲突在0x00cf0041:读
地址0x00000000'。

First chance exception at $00CF0041. Exception class $C0000005 with message 'access violation at 0x00cf0041: read of address 0x00000000'.

它只发生在有时 - 我没有设法找出任何韵律或理由,但是,当时 - 只有在主线程。发生时,调用堆栈包含一个不正确的条目:

It only happens 'sometimes' - I haven't managed to figure out any rhyme or reason, yet, for when - and only in the main thread. When it occurs, the call stack contains one incorrect entry:

对于主线程,这是它应该显示一个大堆栈其他项目。

For the main thread, which this is, it should show a large stack full of other items.

此时,所有其他线程都处于非活动状态(主要位于 WaitForSingleObject 或类似功能)我只看到这个崩溃发生在主线程中。它总是具有相同的一个条目的调用堆栈,在相同的方法在相同的地址。这种方法可能或可能不相关 - 我们在我们的应用程序中使用VCL。但是,我敢打赌的是,事情(可能在很久以前)正在破坏堆栈,而崩溃的地址是有效的随机的。注意,它已经是多个版本的同一个地址,但它可能不是真正随机的。

At this point, all other threads are inactive (mostly sitting in WaitForSingleObject or a similar function.) I have only seen this crash occur in the main thread. It always has the same call stack of one entry, in the same method at the same address. This method may or may not be related - we do use the VCL in our application. My bet, though, is that something (possibly quite a while ago) is corrupting the stack, and the address where it's crashing is effectively random. Note it has been the same address across several builds, though - it's probably not truly random.

这是我试过的:


  • 试图在某一点上可靠地复制它。我没有发现每次都复制它,偶尔做的事情,或者没有,没有明显的理由。这些并不是足够狭窄的动作来将其缩小到特定的代码段。它可能与时间相关,但是在IDE突破的时候,其他线程通常什么都不做。我不能排除线程问题,但认为这是不太可能的。

  • 使用额外的调试语句(额外的调试信息,额外的断言等)构建这样做后,崩溃永远不会发生。

  • 启用 Codeguard 构建。在这样做之后,崩溃永远不会发生,Codeguard没有显示错误。

  • Trying to reproduce it reliably at a certain point. I have found nothing that reproduces it every time, and a couple of things that occasionally do, or do not, for no apparent reason. These are not 'narrow' enough actions to narrow it down to a particular section of code. It may be timing related, but at the point the IDE breaks in, other threads are usually doing nothing. I can't rule out a threading problem, but think it's unlikely.
  • Building with extra debugging statements (extra debug info, extra asserts, etc.) After doing so, the crash never occurs.
  • Building with Codeguard enabled. After doing so, the crash never occurs and Codeguard shows no errors.

我的问题:

1。如何找到导致崩溃的代码?我如何做相当于步行的堆栈?

2。你有什么一般的建议来解决这个崩溃的原因?

我正在使用 Embarcadero RAD Studio 2010 (该项目主要包含C ++ Builder代码和少量Delphi。)

I am using Embarcadero RAD Studio 2010 (the project mostly contains C++ Builder code and small amounts of Delphi.)

编辑:我以为我应该添加实际造成的。有一个线程称为 ReadDirectoryChangesW ,然后使用 GetOverlappedResult 等待事件继续,并对更改进行处理。为了在设置状态标志后终止线程,也发出了事件。问题是当线程退出时,它从未称为 CancelIO 。因此,即使缓冲区,重叠结构和事件不再存在(也没有创建它们的线程上下文),Windows仍然在跟踪更改,并且可能仍在写入缓冲区。当 CancelIO 被调用,没有更多的崩溃。

I thought I should add what actually caused this. There was a thread that called ReadDirectoryChangesW and then, using GetOverlappedResult, waited on an event to continue and do something with the changes. The event was also signalled in order to terminate the thread after setting a status flag. The problem was that when the thread exited it never called CancelIO. As a result, Windows was still tracking changes and probably still writing to the buffer when the directory changed, even though the buffer, overlapped structure and event no longer existed (nor did the thread context in which they were created.) When CancelIO was called, there were no more crashes.

推荐答案

即使IDE-提供的堆栈跟踪不是很完整,这并不意味着堆栈上还没有有用的信息。打开CPU视图并查看堆栈窗格;对于每个CALL操作码,返回地址被推送到堆栈上。由于堆栈向下扩展,您将在当前堆栈位置上找到这些返回地址,即在堆栈窗格中向上滚动。

Even when the IDE-provided stack trace isn't very complete, that doesn't mean there isn't still useful information on the stack. Open up the CPU view and check out the stack pane; for every CALL opcode, a return address is pushed on the stack. Since the stack grows downwards, you'll find these return addresses above the current stack location, i.e. by scrolling upwards in the stack pane.

主线程的堆栈将在大约$ 00120000或$ 00180000(地址空间随机化在Vista和以上使它更随机)。主要可执行程序的代码将在$ 40000000左右。您可以通过右键单击堆栈条目并选择跟随 - >近代码来推测研究堆栈中的元素,看起来不像整数数据(低值)或堆栈地址($ 00120000 + range) ,这将导致拆卸窗口跳转到该代码地址。如果它看起来像无效代码,它可能不是堆栈跟踪中的有效条目。如果它是有效的代码,它可能是操作系统代码(通常大约在$ 77000000及以上),在这种情况下,您将不会有有意义的符号,但是您经常会遇到一个实际的正确堆栈条目。

The stack for the main thread will be somewhere around $00120000 or $00180000 (address space randomization in Vista and upwards has made it more random). Code for the main executable will be somewhere around $00400000. You can speculatively investigate elements on the stack that don't look like integer data (low values) or stack addresses ($00120000+ range) by right-clicking on the stack entry and selecting Follow -> Near Code, which will cause the disassembly window to jump to that code address. If it looks like invalid code, it's probably not a valid entry in the stack trace. If it's valid code, it may be OS code (frequently around $77000000 and above) in which case you won't have meaningful symbols, but every so often you'll hit on an actual proper stack entry.

这种技术虽然有点费力,但是当调试器无法跟踪事情时,可以获得有意义的堆栈跟踪信息。但是,如果ESP(堆栈指针)已经被拧紧,它不会帮助您。幸运的是,这很罕见。

This technique, though somewhat laborious, can get you meaningful stack trace info when the debugger isn't able to trace things through. It doesn't help you if ESP (the stack pointer) has been screwed with, though. Fortunately, that's pretty rare.

这篇关于如何调试难以重现的崩溃,没有有用的调用堆栈?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆