如果段错误不可恢复,为什么称为错误(而不是中止)? [英] Why are segfaults called faults (and not aborts) if they are not recoverable?

查看:28
本文介绍了如果段错误不可恢复,为什么称为错误(而不是中止)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对术语的以下理解是这样的

My following understanding of the terminology is this

1) 中断
是由硬件发起的通知",用于调用操作系统以运行其处理程序

1) An interrupt
is "a notification" that is initiated by the hardware to call the OS to run its handlers

2) 陷阱
是由软件发起的通知",用于调用操作系统以运行其处理程序

2) A trap
is "a notification" that is initiated by the software to call the OS to run its handlers

3) 错误
是一个异常,如果发生错误但可以恢复,则由处理器引发

3) A fault
is an exception that is raised by the processor if an error has occurred but it is recoverable

4) 中止
是处理器在发生错误但不可恢复时引发的异常

4) An abort
is an exception that is raised by the processor if an error has occurred but it is non-recoverable

为什么我们称它为 segmentation fault 而不是 segmentation abort 那么呢?

Why do we call it a segmentation fault and not a segmentation abort then?

分段错误
是当您的程序尝试访问内存时不是由操作系统分配的,或者是其他的不允许访问.

A segmentation fault
is when your program attempts to access memory it has either not been assigned by the operating system, or is otherwise not allowed to access.

我的经验(主要是在测试 C 代码时)是,任何时候程序抛出 segmentation fault 都会回到绘图板上 - 是否存在程序员真的可以捕捉"异常并用它做一些有用的事情吗?

My experience (primarily while testing C code) is that anytime a program throws a segmentation fault it is back to the drawing board - is there a scenario where the programmer can actually "catch" the exception and do something useful with it?

推荐答案

在 CPU 级别,现代操作系统不使用 x86 段限制来保护内存.(事实上​​,即使他们想在长模式下(x86-64)也做不到;段基数固定为 0,限制为 -1).

At a CPU level, modern OSes don't use x86 segment limits for memory protection. (And in fact they couldn't even if they wanted to in long mode (x86-64); segment base is fixed at 0 and limit at -1).

操作系统使用虚拟内存页表,因此越界内存访问的真正 CPU 异常是页面错误.

OSes use virtual memory page tables, so the real CPU exception on an out-of-bounds memory access is a page fault.

x86 手册将此称为 #PF(fault-code) 异常,例如请参阅add 可以引发的异常列表.有趣的事实:在段限制之外访问的 x86 异常是 #GP(0).

x86 manuals call this a #PF(fault-code) exception, e.g. see the list of exceptions add can raise. Fun fact: the x86 exception for access outside of a segment limit is #GP(0).

由操作系统的页面错误处理程序决定如何处理它.许多 #PF 异常是作为正常操作的一部分发生的:

It's up to the OS's page-fault handler to decide how to handle it. Many #PF exceptions happen as part of normal operation:

  • 写时复制映射:复制页面并在页表中将其标记为可写,然后返回用户空间重试出错的指令.(这是一种软"又名次要"页面错误.)
  • 其他软页面错误,例如内核很懒惰,实际上并没有更新页表以反映进程所做的映射.(例如 mmap(2) 没有 MAP_POPULATE).
  • 硬页面错误:查找一些物理内存并从磁盘读取文件(文件映射或从交换文件/匿名页面的分区).

在整理完以上任何一项后,更新 CPU 自行读取的页表,必要时使该 TLB 条目无效.(例如有效但只读更改为有效+读写).

After sorting out any of the above, update the page table that the CPU reads on its own, and invalidate that TLB entry if necessary. (e.g. valid but read-only changed to valid + read-write).

只有当内核发现进程在逻辑上没有任何东西映射到该地址(或者它是对只读映射的写入)时,内核才会提供 SIGSEGV 到流程.这纯粹是软件的事情,在梳理了硬件异常的原因之后.

Only if the kernel finds that the process really doesn't logically have anything mapped to that address (or that it's a write to a read-only mapping) will the kernel deliver a SIGSEGV to the process. This is purely a software thing, after sorting out the cause of the hardware exception.

SIGSEGV 的英文文本 (from strerror(3)) 在所有 Unix/Linux 系统上都是Segmentation Fault",所以这是子进程时(由 shell)打印的内容死于那个信号.

The English text for SIGSEGV (from strerror(3)) is "Segmentation Fault" on all Unix/Linux systems, so that's what's printed (by the shell) when a child process dies from that signal.

这个术语很好理解,所以尽管它主要是出于历史原因而存在,并且硬件不使用分段.

This term is well understood, so even though it mostly only exists for historical reasons and hardware doesn't use segmentation.

请注意,您还会获得一个 SIGSEGV,用于尝试在用户空间中执行特权指令(如 wbinvdwrmsr(写入特定于模型的寄存器)).在 CPU 级别,当您不在 ring 0(内核模式)时,x86 异常是 #GP(0) 用于特权指令.

Note that you also get a SIGSEGV for stuff like trying to execute privileged instructions in user-space (like wbinvd or wrmsr (write model-specific register)). At a CPU level, the x86 exception is #GP(0) for privileged instructions when you're not in ring 0 (kernel mode).

也适用于未对齐的 SSE 指令(如 movaps),尽管其他平台上的一些 Unix 发送 SIGBUS 用于未对齐的访问错误(例如 SPARC 上的 Solaris).

Also for misaligned SSE instructions (like movaps), although some Unixes on other platforms send SIGBUS for misaligned accesses faults (e.g. Solaris on SPARC).

为什么我们称它为分段错误而不是分段中止呢?

Why do we call it a segmentation fault and not a segmentation abort then?

可恢复的.它不会使整个机器/内核崩溃,它只是意味着用户空间进程试图做一些内核不允许的事情.

It is recoverable. It doesn't crash the whole machine / kernel, it just means that user-space process tried to do something that the kernel doesn't allow.

即使是出现段错误的进程也可以恢复.这就是为什么它是一个可捕获的信号,不像 SIGKILL.通常你不能只恢复执行,但你可以有用地记录错误发生在哪里(例如打印精确的异常错误消息,甚至是堆栈回溯).

Even for that process that segfaulted it can be recoverable. This is why it's a catchable signal, unlike SIGKILL. Usually you can't just resume execution, but you can usefully record where the fault was (e.g. print a precise exception error message and even a stack backtrace).

SIGSEGV 的信号处理程序可以是 longjmp 或其他.或者,如果需要 SIGSEGV,则在从信号处理程序返回之前修改用于加载的代码或指针.(例如 对于 Meltdown 漏洞利用,尽管有更有效的技术可以在影子中执行链式负载错误预测或其他抑制异常的东西,而不是实际让 CPU 引发异常并捕获内核提供的 SIGSEGV)

The signal handler for SIGSEGV could longjmp or whatever. Or if the SIGSEGV was expected, then modify the code or the pointer used for the load, before returning from the signal handler. (e.g. for a Meltdown exploit, although there are much more efficient techniques that do the chained loads in the shadow of a mispredict or something else that suppresses the exception, instead of actually letting the CPU raise an exception and catching the SIGSEGV the kernel delivers)

大多数编程语言(除了汇编语言)都不够低级,无法在围绕可能出现段错误的访问进行优化时提供明确定义的行为,从而让您编写一个可恢复的处理程序.这就是为什么您通常只在 SIGSEGV 处理程序中打印错误消息(可能还有堆栈回溯)(如果您安装了一个).

Most programming languages (other than assembly) aren't low-level enough to give well defined behaviour when optimizing around an access that might segfault in a way that would let you write a handler that recovers. This is why usually you don't do anything more than print an error message (and maybe a stack backtrace) in a SIGSEGV handler if you install one at all.

一些用于沙盒语言(如 Javascript)的 JIT 编译器使用硬件内存访问检查来消除 NULL 指针检查.在正常情况下没有故障,所以故障情况有多慢并不重要.

Some JIT compilers for sandboxed languages (like Javascript) use hardware memory access checks to eliminate NULL pointer checks. In the normal case there's no fault, so it doesn't matter how slow the faulting case is.

Java JVM 可以将 JVM 线程接收到的 SIGSEGV 转换为正在运行的 Java 代码的 NullPointerException,对 JVM 没有任何问题.

A Java JVM can turn a SIGSEGV received by a thread of the JVM into a NullPointerException for the Java code it's running, without any problems for the JVM.

  • Effective Null Pointer Check Elimination Utilizing Hardware Trap a research paper on this for Java, from three IBM scientists.

SableVM: 6.2.4各种架构上的硬件支持 关于 NULL 指针检查

SableVM: 6.2.4 Hardware Support on Various Architectures about NULL pointer checks

另一个技巧是将数组的末尾放在页面的末尾(后面是足够大的未映射区域),因此硬件对每次访问的边界检查都是免费的.如果您可以静态地证明索引始终为正,并且不能大于 32 位,则一切就绪.

A further trick is to put the end of an array at the end of a page (followed by a large-enough unmapped region), so bounds-checking on every access is done for free by the hardware. If you can statically prove the index is always positive, and that it can't be larger than 32 bit, you're all set.

我认为没有标准术语可以区分.这取决于你在谈论什么样的恢复.显然,在用户空间可以使硬件完成任何操作后,操作系统可以继续运行,否则非特权用户空间可能会使机器崩溃.

I don't think there's standard terminology to make that distinction. It depends what kind of recovery you're talking about. Obviously the OS can keep running after anything user-space can make the hardware do, otherwise unprivileged user-space could crash the machine.

相关:开启发生中断时,中的指令会发生什么管道?,Andy Glew(从事英特尔 P6 微架构的 CPU 架构师)说陷阱".基本上是由正在运行的代码(而不是外部信号)引起的任何中断,并且同步发生.(例如,当一条错误指令到达流水线的退出阶段而没有先检测到早期的分支错误预测或其他异常时).

Related: On When an interrupt occurs, what happens to instructions in the pipeline?, Andy Glew (CPU architect who worked on Intel's P6 microarchitecture) says "trap" is basically any interrupt that's caused by the code that's running (rather than an external signal), and happens synchronously. (e.g. when a faulting instruction reaches the retirement stage of the pipeline without an earlier branch-mispredict or other exception being detected first).

中止"不是标准的 CPU 架构术语.就像我说的那样,您希望操作系统无论如何都能够继续运行,并且通常只有硬件故障或内核错误才能阻止这种情况.

"Abort" isn't standard CPU-architecture terminology. Like I said, you want the OS to be able to continue no matter what, and only hardware failure or kernel bugs normally prevent that.

AFAIK,中止";也不是非常标准的操作系统术语.Unix 有信号,其中一些是无法捕获的(如 SIGKILL 和 SIGSTOP),但大多数都可以被捕获.

AFAIK, "abort" is not very standard operating-systems terminology either. Unix has signals, and some of them are uncatchable (like SIGKILL and SIGSTOP), but most can be caught.

SIGABRT 可以被信号处理程序.如果处理程序返回,则进程退出,因此如果您不希望这样做,您可以 longjmp 退出它.但是 AFAIK 没有错误条件会引发 SIGABRT;它只能由软件手动发送,例如通过调用 abort() 库函数.(这通常会导致堆栈回溯.)

SIGABRT can be caught by a signal handler. The process exits if the handler returns, so if you don't want that you can longjmp out of it. But AFAIK no error condition raises SIGABRT; it's only sent manually by software, e.g. by calling the abort() library function. (It often results in a stack backtrace.)

如果你查看x86手册或osdev wiki上的这个异常表,有具体含义在这种情况下(感谢@MargaretBloom 的描述):

If you look at x86 manuals or this exception table on the osdev wiki, there are specific meanings in this context (thanks to @MargaretBloom for the descriptions):

  • 陷阱:在指令成功完成后引发,返回地址指向陷阱 inst 之后.#DB 调试和 #OF 溢出 (into) 异常是陷阱.(#DB 的一些来源是错误).但是 int 0x80 或其他软件中断指令也是陷阱,syscall 也是一样(但它把返回地址放在 rcx 中,而不是推送它;syscall 不是一个例外,因此在这个意义上不是一个真正的陷阱)

  • trap: raised after an instruction successfully completed, the return address points after the trapping inst. #DB debug and #OF overflow ( into) exceptions are traps. (Some sources of #DB are faults instead) . But int 0x80 or other software interrupt instructions are also traps, as is syscall (but it puts the return address in rcx instead of pushing it; syscall is not an exception, and thus not really a trap in this sense)

错误:在尝试执行并回滚后引发;返回地址指向错误指令.(大多数异常类型都是故障)

fault: raised after an attempted execution is made and then rolled back; the return address points to the faulting instruction. (Most exception types are faults)

abort 是指返回地址指向不相关的位置(即对于 #DF 双重故障和 #MC 机器-查看).三重故障无法处理;当 CPU 在尝试运行双重故障处理程序时遇到异常时会发生这种情况,并且确实会停止整个 CPU.

abort is when the return address points to an unrelated location (i.e. for #DF double-fault and #MC machine-check). Triple fault can't be handled; it's what happens when the CPU hits an exception trying to run the double-fault handler, and really does stop the whole CPU.

请注意,即使是像 Andy Glew 这样的英特尔 CPU 架构师有时也会使用术语陷阱".更一般地说,我认为在使用讨论计算机架构理论时意味着任何同步异常.不要期望人们会坚持使用上述术语,除非您实际上是在谈论处理 x86 上的特定异常.尽管它是有用且合理的术语,但您可以在其他情况下使用它.但是,如果您想进行区分,您应该澄清每个术语的含义,以便每个人都在同一页面上.

Note that even Intel CPU architects like Andy Glew sometimes use the term "trap" more generally, I think meaning any synchronous exception, when using discussion computer-architecture theory. Don't expect people to stick to the above terminology unless you're actually talking about handling specific exceptions on x86. Although it is useful and sensible terminology, and you could use it in other contexts. But if you want to make the distinction, you should clarify what you mean by each term so everyone's on the same page.

这篇关于如果段错误不可恢复,为什么称为错误(而不是中止)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆