调试令人讨厌的SIGILL崩溃:文本段腐败 [英] Debugging a nasty SIGILL crash: Text Segment corruption

查看:740
本文介绍了调试令人讨厌的SIGILL崩溃:文本段腐败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们是一个基于PowerPC的运行Linux的嵌入式系统。我们正在遇到一个随机的SIGILL崩溃,这是广泛的应用程序。崩溃的根本原因是将要执行的指令清零。这表示驻留在存储器中的文本段的损坏。当文本段以只读方式加载时,应用程序不能破坏它。所以我怀疑一些共同的子系统(DMA?)导致这种腐败。由于问题需要几天再现(由于SIGILL崩溃),它变得很难调查。所以开始我想知道是否和什么时候任何应用程序的文本段已损坏。
我查看了堆栈跟踪和所有的指针,寄存器是正确的。

你们有什么建议,我可以怎么办吗?



一些信息:

Linux 3.12。 19-rt30#1 SMP Fri Mar 11 01:31:24 IST 2016 ppc64 GNU /Linux└

(gdb)btogle
0 0x10457dc0 inxxxÚ



反汇编输出:

=> 0x10457dc0 <+80>:mr r1,r11

0x10457dc4 <+84>:blr件



地址0x10457dc0处的指令: 0x7d615b78 >
捕获SIGILL后找到的指令0x10457dc0: 0x00000000



(gdb)维护信息部分
b $ b 0x10006c60-> 0x106cecac 0x00006c60:.text ALLOC LOAD READONLY CODE HAS_CONTENTS royalty





(gdb)x / 32 0x10457da0

0x10457da0:0x913e0000 0x4bff4f5d 0x397f0020 0x800b0004

0x10457db0:0x83abfff4 0x83cbfff8 0x7c0803a60x83ebfffcã
0x10457dc0: 0x7d615b78 0x4e800020 0x7c7d1b78 0x7fc3f378

0x10457dd0:0x4bcd8be5 0x7fa3eb78 0x4857e109 0x9421fff0



实际在处理SIGILL并转储附近的内存位置后):

错误指令地址:0x10457dc0

0x10457da0:0x913E0000

0x10457db0:0x83ABFFF4
b $ b => 0x10457dc0: 0x00000000

0x10457dd0:0x4BCD8BE5

0x10457de0:0x93E1000C



编辑:

我们得到的一个优点是,损坏始终发生在以0xdc0结尾的偏移量。

例如

故障指令地址: 0x10653dc0 <捕获SIGILL后由我们的应用程序打印

故障指令地址: 0x1000ddc0 <捕获SIGILL后由我们的应用程序打印

flash_erase [8557]:未处理的信号4在 0fed6dc0 nip 0fed6dc0 lr 0fed6dac代码30001

nandwrite [8561]:未处理信号4在 0fed6dc0 nip 0fed6dc0 lr 0fed6dac代码30001

awk [4448]:未处理的信号4在 0fe09dc0 nip 0fe09dc0 lr 0fe09dbc code 30001

awk [16002]: 0fe09dc0 处的未处理信号4 0fe09dc0 lr 0fe09dbc代码30001

getStats [20670]: 0fecfdc0 < > nip 0fecfdc0 lr 0fecfdc0 lr 0fecfdbc code 30001

expr [27923]:未处理的信号4在 0fe74dc0 nip 0fe74dc0 lr 0fe74dc0 code 30001



编辑2:另一个导致是损坏总是发生在物理帧号0x00a4d。我假设PAGE_SIZE为4096,这转换为物理地址0x00A4DDC0。我们怀疑我们的内核驱动程序和进一步调查。有什么更好的主意(如推出硬件观察点),可以更有效率吗?如下所示的KASAN如何?



任何帮助。感谢。

解决方案

1)文本段是RO,但权限可以通过 mprotect ,您可以检查您是否认为可能



2。)如果是内核问题:




  • 使用 KASAN 和KUBSAN(未定义行为)清除剂

  • 关注未包含在主线

  • 这里的提示是一个字节损坏。也许我错了,但这意味着DMA不是怪。它看起来像是某种无效的商店。



3。




  • 您可以尝试在bootloader中减少RAM系统频率

  • 检查此问题是否在稳定的主线软件上重现,这是您可以证明它的原因


Ours is a PowerPC based embedded system running Linux. We are encountering a random SIGILL crash which is seen for wide variety of applications. The root-cause for the crash is zeroing out of the instruction to be executed. This indicates corruption of the text segment residing in memory. As the text segment is loaded read-only, the application cannot corrupt it. So I am suspecting some common sub-system (DMA?) causing this corruption. Since the problem takes days to reproduce (crash due to SIGILL) it is getting difficult to investigate. So to begin with I want to be able to know if and when the text segment of any application has been corrupted. I have looked at the stack trace and all the pointers, registers are proper.
Do you guys have any suggestions how I can go about it?

Some Info:
Linux 3.12.19-rt30 #1 SMP Fri Mar 11 01:31:24 IST 2016 ppc64 GNU/Linux

(gdb) bt
0 0x10457dc0 in xxx

Disassembly output:
=> 0x10457dc0 <+80>: mr r1,r11
0x10457dc4 <+84>: blr

Instruction expected at address 0x10457dc0: 0x7d615b78
Instruction found after catching SIGILL 0x10457dc0: 0x00000000

(gdb) maintenance info sections
0x10006c60->0x106cecac at 0x00006c60: .text ALLOC LOAD READONLY CODE HAS_CONTENTS

Expected (from the application binary):
(gdb) x /32 0x10457da0
0x10457da0 : 0x913e0000 0x4bff4f5d 0x397f0020 0x800b0004
0x10457db0 : 0x83abfff4 0x83cbfff8 0x7c0803a6 0x83ebfffc
0x10457dc0 : 0x7d615b78 0x4e800020 0x7c7d1b78 0x7fc3f378
0x10457dd0 : 0x4bcd8be5 0x7fa3eb78 0x4857e109 0x9421fff0

Actual (after handling SIGILL and dumping nearby memory locations):
Faulting instruction address: 0x10457dc0
0x10457da0 : 0x913E0000
0x10457db0 : 0x83ABFFF4
=> 0x10457dc0 : 0x00000000
0x10457dd0 : 0x4BCD8BE5
0x10457de0 : 0x93E1000C

Edit:
One lead that we have is that the corruption is always occurring at an offset that ends with 0xdc0.
For e.g.
Faulting instruction address: 0x10653dc0 << printed by our application after catching SIGILL
Faulting instruction address: 0x1000ddc0 << printed by our application after catching SIGILL
flash_erase[8557]: unhandled signal 4 at 0fed6dc0 nip 0fed6dc0 lr 0fed6dac code 30001
nandwrite[8561]: unhandled signal 4 at 0fed6dc0 nip 0fed6dc0 lr 0fed6dac code 30001
awk[4448]: unhandled signal 4 at 0fe09dc0 nip 0fe09dc0 lr 0fe09dbc code 30001
awk[16002]: unhandled signal 4 at 0fe09dc0 nip 0fe09dc0 lr 0fe09dbc code 30001
getStats[20670]: unhandled signal 4 at 0fecfdc0 nip 0fecfdc0 lr 0fecfdbc code 30001
expr[27923]: unhandled signal 4 at 0fe74dc0 nip 0fe74dc0 lr 0fe74dc0 code 30001

Edit 2: Another lead is that the corruption is always occurring at physical frame number 0x00a4d. I suppose with PAGE_SIZE of 4096 this translates to physical address of 0x00A4DDC0. We are suspecting couple of our kernel drivers and investigating further. Is there any better idea (like putting hardware watchpoint) which could be more efficient? How about KASAN as suggested below?

Any help is appreciated. Thanks.

解决方案

1.) Text segment is RO, but the permissions could be changed by mprotect, you can check that if you think it is possible

2.) If it is kernel problem:

  • Run kernel with KASAN and KUBSAN (undefined behaviour) sanitizers
  • Focus on drivers code not included in mainline
  • The hint here is one byte corruption. Maybe i'm wrong, but it means that DMA is not to blame. It looks like some kind of invalid store.

3.) Hardware. I think, your problem looks like a hardware problem (RAM issue).

  • You can try to decrease RAM system frequency in bootloader
  • Check if this problem reproduces on stable mainline software, that is how you can prove that it's it

这篇关于调试令人讨厌的SIGILL崩溃:文本段腐败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆