“无法解释"核心转储 [英] "Unexplainable" core dump

查看：19 发布时间：2022/1/12 16:12:37 linux segmentation-fault x86-64

本文介绍了“无法解释"核心转储的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在生活中见过很多核心转储，但这个让我很难过.

上下文:

在 AMD Barcelona CPU 集群上运行的多线程 Linux/x86_64 程序
崩溃的代码被执行了很多
在负载下运行 1000 个程序实例(完全相同的优化二进制文件)每小时会产生 1-2 次崩溃
崩溃发生在不同的机器上(但机器本身非常相似)
所有崩溃看起来都一样(相同的确切地址，相同的调用堆栈)

以下是崩溃的详细信息:

程序以信号 11 终止，分段错误.#0 0x00000000017bd9fd 在 Foo()(gdb) x/i $pc=>0x17bd9fd <_Z3Foov+349>: rex.RB orb $0x8d,(%r15)(gdb) x/6i $pc-120x17bd9f1 <_Z3Foov+337>: mov (%rbx),%eax0x17bd9f3 <_Z3Foov+339>: mov %rbx,%rdi0x17bd9f6 <_Z3Foov+342>: callq *0x70(%rax)0x17bd9f9 <_Z3Foov+345>: cmp %eax,%r12d0x17bd9fc <_Z3Foov+348>: mov %eax,-0x80(%rbp)0x17bd9ff <_Z3Foov+351>: jge 0x17bd97e <_Z3Foov+222>

您会注意到崩溃发生在 0x17bd9fc 处的指令中间，这是在从 0x17bd9f6 处的调用返回之后一个虚函数.

当我检查虚拟表时，我发现它没有以任何方式损坏:

(gdb) x/a $rbx0x2ab094951f80: 0x3f8c550 <_ZTI4Foo1+16>(gdb) x/a 0x3f8c550+0x700x3f8c5c0 <_ZTI4Foo1+128>: 0x2d3d7b0 <_ZN4Foo13GetEv>

并且它指向这个微不足道的函数(正如通过查看源代码所预期的那样):

(gdb) disas 0x2d3d7b0函数_ZN4Foo13GetEv的汇编代码转储:0x0000000002d3d7b0 <+0>: 推送 %rbp0x0000000002d3d7b1 <+1>: mov 0x70(%rdi),%eax0x0000000002d3d7b4 <+4>: mov %rsp,%rbp0x0000000002d3d7b7 <+7>: leaveq0x0000000002d3d7b8 <+8>: retq汇编程序转储结束.

此外，当我查看 Foo1::Get() 应该返回的返回地址时:

(gdb) x/a $rsp-80x2afa55602048: 0x17bd9f9 <_Z3Foov+345>

我看到它指向了正确的指令，所以好像在从 Foo1::Get() 返回期间，出现了一些 gremlin 并增加了 %rip4.

合理的解释?

解决方案

因此，尽管看起来不太可能，但我们似乎遇到了真正真正的 CPU 错误.

http://support.amd.com/us/Processor_TechDocs/41322_10h_Rev_Gd.pdf 有错误 #721:

721 处理器可能错误地更新堆栈指针

<块引用>

说明

在一套高度具体和详细的内部时序条件下，长系列后处理器可能会错误地更新堆栈指针推送和/或近调用指令，或一长串弹出和/或接近返回指令.处理器必须处于 64 位模式出现此错误.

<块引用>

对系统的潜在影响

堆栈指针值大约跳了1024，无论是在正向或负向.这个不正确的堆栈指针会导致不可预知的程序或系统行为，通常被视为程序异常或崩溃(例如，#GP 或#UD).

I've seen many core dumps in my life, but this one has me stumped.

Context:

multi-threaded Linux/x86_64 program running on a cluster of AMD Barcelona CPUs
the code that crashes is executed a lot
running 1000 instances of the program (the exact same optimized binary) under load produces 1-2 crashes per hour
the crashes happen on different machines (but the machines themselves are pretty identical)
the crashes all look the same (same exact address, same call stack)

Here are the details of the crash:

Program terminated with signal 11, Segmentation fault.
#0  0x00000000017bd9fd in Foo()
(gdb) x/i $pc
=> 0x17bd9fd <_Z3Foov+349>: rex.RB orb $0x8d,(%r15)

(gdb) x/6i $pc-12
0x17bd9f1 <_Z3Foov+337>:    mov    (%rbx),%eax
0x17bd9f3 <_Z3Foov+339>:    mov    %rbx,%rdi
0x17bd9f6 <_Z3Foov+342>:    callq  *0x70(%rax)
0x17bd9f9 <_Z3Foov+345>:    cmp    %eax,%r12d
0x17bd9fc <_Z3Foov+348>:    mov    %eax,-0x80(%rbp)
0x17bd9ff <_Z3Foov+351>:    jge    0x17bd97e <_Z3Foov+222>

You'll notice that the crash happened in the middle of instruction at 0x17bd9fc, which is after return from a call at 0x17bd9f6 to a virtual function.

When I examine the virtual table, I see that it is not corrupted in any way:

(gdb) x/a $rbx
0x2ab094951f80: 0x3f8c550 <_ZTI4Foo1+16>
(gdb) x/a 0x3f8c550+0x70
0x3f8c5c0 <_ZTI4Foo1+128>:  0x2d3d7b0 <_ZN4Foo13GetEv>

and that it points to this trivial function (as expected by looking at the source):

(gdb) disas 0x2d3d7b0
Dump of assembler code for function _ZN4Foo13GetEv:
   0x0000000002d3d7b0 <+0>: push   %rbp
   0x0000000002d3d7b1 <+1>: mov    0x70(%rdi),%eax
   0x0000000002d3d7b4 <+4>: mov    %rsp,%rbp
   0x0000000002d3d7b7 <+7>: leaveq 
   0x0000000002d3d7b8 <+8>: retq   
End of assembler dump.

Further, when I look at the return address that Foo1::Get() should have returned to:

(gdb) x/a $rsp-8
0x2afa55602048: 0x17bd9f9 <_Z3Foov+345>

I see that it points to the right instruction, so it's as if during the return from Foo1::Get(), some gremlin came along and incremented %rip by 4.

Plausible explanations?

解决方案

So, unlikely as it may seem, we appear to have hit an actual bona-fide CPU bug.

http://support.amd.com/us/Processor_TechDocs/41322_10h_Rev_Gd.pdf has erratum #721:

721 Processor May Incorrectly Update Stack Pointer

Description

Under a highly specific and detailed set of internal timing conditions,
the processor may incorrectly update the stack pointer after a long series
of push and/or near-call instructions, or a long series of pop 
and/or near-return instructions. The processor must be in 64-bit mode for
this erratum to occur.

Potential Effect on System

The stack pointer value jumps by a value of approximately 1024, either in
the positive or negative direction.
This incorrect stack pointer causes unpredictable program or system behavior,
usually observed as a program exception or crash (for example, a #GP or #UD).

这篇关于“无法解释"核心转储的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

“无法解释"核心转储 [英] "Unexplainable" core dump

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

“无法解释"核心转储 [英] &quot;Unexplainable&quot; core dump

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

“无法解释"核心转储 [英] "Unexplainable" core dump

登录关闭