“无法解释";核心转储 [英] "Unexplainable" core dump
问题描述
我一生中看到过很多核心转储,但这一次让我很困惑.
I've seen many core dumps in my life, but this one has me stumped.
上下文:
-
在 AMD Barcelona CPU 的群集上运行的
- 多线程Linux/x86_64程序
- 崩溃的代码在很多 中执行
- 在负载下运行1000个程序实例(完全相同的优化二进制文件),每小时会产生1-2次崩溃
- 崩溃发生在不同的计算机上(但是计算机本身是完全相同的)
- 所有崩溃看起来都是一样的(相同的地址,相同的调用堆栈)
- multi-threaded Linux/x86_64 program running on a cluster of AMD Barcelona CPUs
- the code that crashes is executed a lot
- running 1000 instances of the program (the exact same optimized binary) under load produces 1-2 crashes per hour
- the crashes happen on different machines (but the machines themselves are pretty identical)
- the crashes all look the same (same exact address, same call stack)
以下是崩溃的详细信息:
Here are the details of the crash:
Program terminated with signal 11, Segmentation fault.
#0 0x00000000017bd9fd in Foo()
(gdb) x/i $pc
=> 0x17bd9fd <_Z3Foov+349>: rex.RB orb $0x8d,(%r15)
(gdb) x/6i $pc-12
0x17bd9f1 <_Z3Foov+337>: mov (%rbx),%eax
0x17bd9f3 <_Z3Foov+339>: mov %rbx,%rdi
0x17bd9f6 <_Z3Foov+342>: callq *0x70(%rax)
0x17bd9f9 <_Z3Foov+345>: cmp %eax,%r12d
0x17bd9fc <_Z3Foov+348>: mov %eax,-0x80(%rbp)
0x17bd9ff <_Z3Foov+351>: jge 0x17bd97e <_Z3Foov+222>
您会注意到,崩溃发生在0x17bd9f6
处的调用返回到虚函数之后的.
You'll notice that the crash happened in the middle of instruction at 0x17bd9fc
, which is after return from a call at 0x17bd9f6
to a virtual function.
当我检查虚拟表时,我发现它没有以任何方式损坏:
When I examine the virtual table, I see that it is not corrupted in any way:
(gdb) x/a $rbx
0x2ab094951f80: 0x3f8c550 <_ZTI4Foo1+16>
(gdb) x/a 0x3f8c550+0x70
0x3f8c5c0 <_ZTI4Foo1+128>: 0x2d3d7b0 <_ZN4Foo13GetEv>
,它指向这个琐碎的函数(通过查看源代码可以预期):
and that it points to this trivial function (as expected by looking at the source):
(gdb) disas 0x2d3d7b0
Dump of assembler code for function _ZN4Foo13GetEv:
0x0000000002d3d7b0 <+0>: push %rbp
0x0000000002d3d7b1 <+1>: mov 0x70(%rdi),%eax
0x0000000002d3d7b4 <+4>: mov %rsp,%rbp
0x0000000002d3d7b7 <+7>: leaveq
0x0000000002d3d7b8 <+8>: retq
End of assembler dump.
此外,当我查看Foo1::Get()
应该返回的寄信人地址时:
Further, when I look at the return address that Foo1::Get()
should have returned to:
(gdb) x/a $rsp-8
0x2afa55602048: 0x17bd9f9 <_Z3Foov+345>
我看到它指向了正确的指令,所以好像在从Foo1::Get()
返回时,出现了一些小怪兽,并将%rip
递增了4.
I see that it points to the right instruction, so it's as if during the return from Foo1::Get()
, some gremlin came along and incremented %rip
by 4.
合理的解释?
推荐答案
因此,看起来不太可能,我们似乎已经遇到了真正的CPU错误.
So, unlikely as it may seem, we appear to have hit an actual bona-fide CPU bug.
http://support.amd.com/us/Processor_TechDocs/41322_10h_Rev_Gd.pdf 具有erratum#721:
http://support.amd.com/us/Processor_TechDocs/41322_10h_Rev_Gd.pdf has erratum #721:
721处理器可能不正确地更新堆栈指针
说明
Under a highly specific and detailed set of internal timing conditions,
the processor may incorrectly update the stack pointer after a long series
of push and/or near-call instructions, or a long series of pop
and/or near-return instructions. The processor must be in 64-bit mode for
this erratum to occur.
对系统的潜在影响
Potential Effect on System
The stack pointer value jumps by a value of approximately 1024, either in
the positive or negative direction.
This incorrect stack pointer causes unpredictable program or system behavior,
usually observed as a program exception or crash (for example, a #GP or #UD).
这篇关于“无法解释";核心转储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!