“无法解释"核心转储 [英] "Unexplainable" core dump

查看:19
本文介绍了“无法解释"核心转储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在生活中见过很多核心转储,但这个让我很难过.

上下文:

  • AMD Barcelona CPU 集群上运行的多线程 Linux/x86_64 程序
  • 崩溃的代码被执行了很多
  • 在负载下运行 1000 个程序实例(完全相同的优化二进制文件)每小时会产生 1-2 次崩溃
  • 崩溃发生在不同的机器上(但机器本身非常相似)
  • 所有崩溃看起来都一样(相同的确切地址,相同的调用堆栈)

以下是崩溃的详细信息:

程序以信号 11 终止,分段错误.#0 0x00000000017bd9fd 在 Foo()(gdb) x/i $pc=>0x17bd9fd <_Z3Foov+349>: rex.RB orb $0x8d,(%r15)(gdb) x/6i $pc-120x17bd9f1 <_Z3Foov+337>: mov (%rbx),%eax0x17bd9f3 <_Z3Foov+339>: mov %rbx,%rdi0x17bd9f6 <_Z3Foov+342>: callq *0x70(%rax)0x17bd9f9 <_Z3Foov+345>: cmp %eax,%r12d0x17bd9fc <_Z3Foov+348>: mov %eax,-0x80(%rbp)0x17bd9ff <_Z3Foov+351>: jge 0x17bd97e <_Z3Foov+222>

您会注意到崩溃发生在 0x17bd9fc 处的指令中间,这是在从 0x17bd9f6 处的调用返回之后一个虚函数.

当我检查虚拟表时,我发现它没有以任何方式损坏:

(gdb) x/a $rbx0x2ab094951f80: 0x3f8c550 <_ZTI4Foo1+16>(gdb) x/a 0x3f8c550+0x700x3f8c5c0 <_ZTI4Foo1+128>: 0x2d3d7b0 <_ZN4Foo13GetEv>

并且它指向这个微不足道的函数(正如通过查看源代码所预期的那样):

(gdb) disas 0x2d3d7b0函数_ZN4Foo13GetEv的汇编代码转储:0x0000000002d3d7b0 <+0>: 推送 %rbp0x0000000002d3d7b1 <+1>: mov 0x70(%rdi),%eax0x0000000002d3d7b4 <+4>: mov %rsp,%rbp0x0000000002d3d7b7 <+7>: leaveq0x0000000002d3d7b8 <+8>: retq汇编程序转储结束.

此外,当我查看 Foo1::Get() 应该返回的返回地址时:

(gdb) x/a $rsp-80x2afa55602048: 0x17bd9f9 <_Z3Foov+345>

我看到它指向了正确的指令,所以好像在从 Foo1::Get() 返回期间,出现了一些 gremlin 并增加了 %rip4.

合理的解释?

解决方案

因此,尽管看起来不太可能,但我们似乎遇到了真正真正的 CPU 错误.

http://support.amd.com/us/Processor_TechDocs/41322_10h_Rev_Gd.pdf 有错误 #721:

721 处理器可能错误地更新堆栈指针

<块引用>

说明

在一套高度具体和详细的​​内部时序条件下,长系列后处理器可能会错误地更新堆栈指针推送和/或近调用指令,或一长串弹出和/或接近返回指令.处理器必须处于 64 位模式出现此错误.

<块引用>

对系统的潜在影响

堆栈指针值大约跳了1024,无论是在正向或负向.这个不正确的堆栈指针会导致不可预知的程序或系统行为,通常被视为程序异常或崩溃(例如,#GP 或#UD).

I've seen many core dumps in my life, but this one has me stumped.

Context:

  • multi-threaded Linux/x86_64 program running on a cluster of AMD Barcelona CPUs
  • the code that crashes is executed a lot
  • running 1000 instances of the program (the exact same optimized binary) under load produces 1-2 crashes per hour
  • the crashes happen on different machines (but the machines themselves are pretty identical)
  • the crashes all look the same (same exact address, same call stack)

Here are the details of the crash:

Program terminated with signal 11, Segmentation fault.
#0  0x00000000017bd9fd in Foo()
(gdb) x/i $pc
=> 0x17bd9fd <_Z3Foov+349>: rex.RB orb $0x8d,(%r15)

(gdb) x/6i $pc-12
0x17bd9f1 <_Z3Foov+337>:    mov    (%rbx),%eax
0x17bd9f3 <_Z3Foov+339>:    mov    %rbx,%rdi
0x17bd9f6 <_Z3Foov+342>:    callq  *0x70(%rax)
0x17bd9f9 <_Z3Foov+345>:    cmp    %eax,%r12d
0x17bd9fc <_Z3Foov+348>:    mov    %eax,-0x80(%rbp)
0x17bd9ff <_Z3Foov+351>:    jge    0x17bd97e <_Z3Foov+222>

You'll notice that the crash happened in the middle of instruction at 0x17bd9fc, which is after return from a call at 0x17bd9f6 to a virtual function.

When I examine the virtual table, I see that it is not corrupted in any way:

(gdb) x/a $rbx
0x2ab094951f80: 0x3f8c550 <_ZTI4Foo1+16>
(gdb) x/a 0x3f8c550+0x70
0x3f8c5c0 <_ZTI4Foo1+128>:  0x2d3d7b0 <_ZN4Foo13GetEv>

and that it points to this trivial function (as expected by looking at the source):

(gdb) disas 0x2d3d7b0
Dump of assembler code for function _ZN4Foo13GetEv:
   0x0000000002d3d7b0 <+0>: push   %rbp
   0x0000000002d3d7b1 <+1>: mov    0x70(%rdi),%eax
   0x0000000002d3d7b4 <+4>: mov    %rsp,%rbp
   0x0000000002d3d7b7 <+7>: leaveq 
   0x0000000002d3d7b8 <+8>: retq   
End of assembler dump.

Further, when I look at the return address that Foo1::Get() should have returned to:

(gdb) x/a $rsp-8
0x2afa55602048: 0x17bd9f9 <_Z3Foov+345>

I see that it points to the right instruction, so it's as if during the return from Foo1::Get(), some gremlin came along and incremented %rip by 4.

Plausible explanations?

解决方案

So, unlikely as it may seem, we appear to have hit an actual bona-fide CPU bug.

http://support.amd.com/us/Processor_TechDocs/41322_10h_Rev_Gd.pdf has erratum #721:

721 Processor May Incorrectly Update Stack Pointer

Description

Under a highly specific and detailed set of internal timing conditions,
the processor may incorrectly update the stack pointer after a long series
of push and/or near-call instructions, or a long series of pop 
and/or near-return instructions. The processor must be in 64-bit mode for
this erratum to occur.

Potential Effect on System

The stack pointer value jumps by a value of approximately 1024, either in
the positive or negative direction.
This incorrect stack pointer causes unpredictable program or system behavior,
usually observed as a program exception or crash (for example, a #GP or #UD).

这篇关于“无法解释"核心转储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆