如果段故障无法恢复,为什么将它们称为故障(而不是中止)? [英] Why are segfaults called faults (and not aborts) if they are not recoverable?

查看:95
本文介绍了如果段故障无法恢复,为什么将它们称为故障(而不是中止)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的以下术语的理解是这样的

My following understanding of the terminology is this

<强> 1)中断结果 是由硬件启动以调用操作系统以运行其处理程序的通知"

1) An interrupt
is "a notification" that is initiated by the hardware to call the OS to run its handlers

<强> 2)陷阱结果 是由软件启动的通知",用于调用操作系统以运行其处理程序

2) A trap
is "a notification" that is initiated by the software to call the OS to run its handlers

第3)中的故障结果 是由处理器升高是否发生了错误的异常,但它是可采

3) A fault
is an exception that is raised by the processor if an error has occurred but it is recoverable

<强> 4)中止结果 是由处理器升高是否发生了错误的异常,但它是不可恢复的

4) An abort
is an exception that is raised by the processor if an error has occurred but it is non-recoverable

为什么我们把它叫做而不是呢?

Why do we call it a segmentation fault and not a segmentation abort then?

:一种分段故障结果 当你的程序试图访问其内存 已经或者尚未由操作系统分配,或以其它方式 不允许访问.

A segmentation fault
is when your program attempts to access memory it has either not been assigned by the operating system, or is otherwise not allowed to access.

我的经验(主要是在测试代码)是随时随地程序抛出一个是回到绘图板 - 有这样一个场景,程序员实际上可以捕获"了异常并做些有益的事与它

My experience (primarily while testing C code) is that anytime a program throws a segmentation fault it is back to the drawing board - is there a scenario where the programmer can actually "catch" the exception and do something useful with it?

推荐答案

目前一个CPU水平,现代操作系统不使用用于存储器保护段86的限制. (事实上​​它们甚至如果他们想不能在长模式(x86-64的);和段基固定在0极限在-1).

At a CPU level, modern OSes don't use x86 segment limits for memory protection. (And in fact they couldn't even if they wanted to in long mode (x86-64); segment base is fixed at 0 and limit at -1).

操作系统使用虚拟内存页表,所以上的出界外的存储器访问实际CPU的例外是一页面错误.

OSes use virtual memory page tables, so the real CPU exception on an out-of-bounds memory access is a page fault.

86本手册称之为一个的 例外下,例如看到例外可以提高名单.有趣的事实:对于一个段的限制访问外部是

x86 manuals call this a #PF(fault-code) exception, e.g. see the list of exceptions add can raise. Fun fact: the x86 exception for access outside of a segment limit is #GP(0).

由操作系统的页面错误处理程序决定如何处理它.许多异常发生作为正常操作的一部分:

It's up to the OS's page-fault handler to decide how to handle it. Many #PF exceptions happen as part of normal operation:

  • 写入时复制映射得到书面:复制页面并将其标记可写页表,然后返回到用户空间重试指令故障
  • 软页面错误:内核是偷懒,实际上没有更新,以反映进程中取得的映射的页表. (例如,不带MAP_POPULATE mmap(2) ). /LI>
  • 硬页错误:发现一些物理存储器和从磁盘中读取文件(文件映射或从交换文件/分区为匿名页)
  • copy-on-write mapping got written: copy the page and mark it writeable in the page table, then return to user-space to retry the instruction that faulted.
  • soft page fault: the kernel was lazy and didn't actually have the page table updated to reflect the mappings the process made. (e.g. mmap(2) without MAP_POPULATE).
  • hard page fault: find some physical memory and read the file from disk (a file mapping or from swap file/partition for anonymous pages).

整理出任何上述的后,更新页表,该CPU自身读取,并在必要时无效该TLB条目. (例如有效的,但只读变为有效+读写).

After sorting out any of the above, update the page table that the CPU reads on its own, and invalidate that TLB entry if necessary. (e.g. valid but read-only changed to valid + read-write).

只有当内核发现过程真的没有逻辑上有任何映射到该地址(或它的一个只读映射写)将内核提供一个 的过程.的这是一个纯粹的软件的事情,整理出硬件异常的原因后.

Only if the kernel finds that the process really doesn't logically have anything mapped to that address (or that it's a write to a read-only mapping) will the kernel deliver a SIGSEGV to the process. This is purely a software thing, after sorting out the cause of the hardware exception.

SIGSEGV的英语文本(来自 )为分割故障" 在所有的Unix/Linux系统,所以这是打印的内容(由壳的),当从该信号中的子进程模具.

The English text for SIGSEGV (from strerror(3)) is "Segmentation Fault" on all Unix/Linux systems, so that's what's printed (by the shell) when a child process dies from that signal.

此术语是很好理解的,所以尽管它主要只存在由于历史原因和硬件不使用分割.

This term is well understood, so even though it mostly only exists for historical reasons and hardware doesn't use segmentation.

请注意,您还会获得SIGSEGV,用于尝试在用户空间中执行特权指令(例如wbinvd (写入模型特定寄存器)).在CPU的水平,在x86的例外是当你在ring 0(内核模式)是不是<6>的特权指令.

Note that you also get a SIGSEGV for stuff like trying to execute privileged instructions in user-space (like wbinvd or wrmsr (write model-specific register)). At a CPU level, the x86 exception is #GP(0) for privileged instructions when you're not in ring 0 (kernel mode).

同样对于未对准SSE指令(如),虽然一些的Unix在其他平台上发送对未对准的访问故障(例如在SPARC的Solaris)

Also for misaligned SSE instructions (like movaps), although some Unixes on other platforms send SIGBUS for misaligned accesses faults (e.g. Solaris on SPARC).

为什么我们把它叫做分段错误,而不是分割中止呢?

Why do we call it a segmentation fault and not a segmentation abort then?

<强>它就是可收回即可.它不会崩溃整机/内核,它只是意味着试图做一些事情,内核不允许用户空间进程.

It is recoverable. It doesn't crash the whole machine / kernel, it just means that user-space process tried to do something that the kernel doesn't allow.

即使对其进行段隔离的进程也可以恢复.这就是为什么它是一个开捕信号,不像<18>.一般不能只恢复执行,但可以有效地记录故障是在哪里(例如打印一个精确异常错误消息,并且甚至在堆栈中).

Even for that process that segfaulted it can be recoverable. This is why it's a catchable signal, unlike SIGKILL. Usually you can't just resume execution, but you can usefully record where the fault was (e.g. print a precise exception error message and even a stack backtrace).

有可能SIGSEGV 或任何的信号处理程序.或者如果SIGSEGV是意料之中的,然后修改代码或者用于所述负载的指针,从该信号处理程序返回之前. (如的崩溃漏洞,虽然有更有效的技术,做链接的负载阴影误预测还是其他什么东西的抑制例外,而不是实际让CPU抛出一个异常并捕获内核提供的SIGSEGV)

The signal handler for SIGSEGV could longjmp or whatever. Or if the SIGSEGV was expected, then modify the code or the pointer used for the load, before returning from the signal handler. (e.g. for a Meltdown exploit, although there are much more efficient techniques that do the chained loads in the shadow of a mispredict or something else that suppresses the exception, instead of actually letting the CPU raise an exception and catching the SIGSEGV the kernel delivers)

大多数编程语言(不是汇编等)不低级别足以给明确的行为优化各地可能的方式,将让段错误的访问,当你写一个处理程序,其回收.这就是为什么通常你不必做任何事情超过在打印SIGSEGV处理错误消息(也许在堆栈中)如果您安装一个都没有.

Most programming languages (other than assembly) aren't low-level enough to give well defined behaviour when optimizing around an access that might segfault in a way that would let you write a handler that recovers. This is why usually you don't do anything more than print an error message (and maybe a stack backtrace) in a SIGSEGV handler if you install one at all.

沙盒的语言(例如JavaScript)使用硬件内存访问检查一些JIT编译消除NULL指针检查.在正常情况下有没有过错,所以它并没有多么慢出错的情况.

Some JIT compilers for sandboxed languages (like Javascript) use hardware memory access checks to eliminate NULL pointer checks. In the normal case there's no fault, so it doesn't matter how slow the faulting case is.

:一种爪哇JVM可以把由JVM的一个线程接收到一个为它的运行中的Java代码,而无需为JVM的任何问题.

  • Effective Null Pointer Check Elimination Utilizing Hardware Trap a research paper on this for Java, from three IBM scientists.

SableVM:6.2.4上各种架构硬件支持约NULL指针检查

一个进一步诀窍是把一个阵列的端部在一个页面(后面跟一个足够大的未映射的区域)的端部,所以在每一个接入边界检查由硬件免费完成.如果你能证明静态的指标总是正的,而且它不能大于32位的,就这么简单.

A further trick is to put the end of an array at the end of a page (followed by a large-enough unmapped region), so bounds-checking on every access is done for free by the hardware. If you can statically prove the index is always positive, and that it can't be larger than 32 bit, you're all set.

我不认为有标准术语来进行区分.这取决于您在谈论哪种恢复.显然,OS可以让后面的所有用户空间运行可以使硬件做,否则非特权用户空间可能崩溃的机器.

I don't think there's standard terminology to make that distinction. It depends what kind of recovery you're talking about. Obviously the OS can keep running after anything user-space can make the hardware do, otherwise unprivileged user-space could crash the machine.

相关:在 当发生中断时,会发生什么指令管道?,安迪GLEW(CPU建筑师谁英特尔P6微架构的工作)说:陷阱"基本上是由一款正在运行的代码(而不是外部信号)引起的任何中断,并同步发生. (例如,当一个错误指令到达流水线的退休阶段没有早期分支误预测或其他例外是​​第一检测的).

Related: On When an interrupt occurs, what happens to instructions in the pipeline?, Andy Glew (CPU architect who worked on Intel's P6 microarchitecture) says "trap" is basically any interrupt that's caused by the code that's running (rather than an external signal), and happens synchronously. (e.g. when a faulting instruction reaches the retirement stage of the pipeline without an earlier branch-mispredict or other exception being detected first).

中止"不是标准的CPU体系结构术语.就像我说的那样,您希望操作系统无论如何都能继续运行,并且只有硬件故障或内核错误通常才能阻止这种情况.

"Abort" isn't standard CPU-architecture terminology. Like I said, you want the OS to be able to continue no matter what, and only hardware failure or kernel bugs normally prevent that.

AFAIK,中止"不是很标准操作的系统中的术语无论是. Unix有信号,并且它们中的一些是不可捕获(如SIGKILL和SIGSTOP),但大多数可以被捕获.

AFAIK, "abort" is not very standard operating-systems terminology either. Unix has signals, and some of them are uncatchable (like SIGKILL and SIGSTOP), but most can be caught.

<强> 可以通过一个信号被捕获处理程序 即可.如果处理程序返回,则该过程退出,因此,如果您不希望退出该过程,则可以longjmp.但据我所知没有错误条件引发SIGABRT;它仅通过软件,例如手动送通过调用库函数. (这通常导致在堆栈中.)

SIGABRT can be caught by a signal handler. The process exits if the handler returns, so if you don't want that you can longjmp out of it. But AFAIK no error condition raises SIGABRT; it's only sent manually by software, e.g. by calling the abort() library function. (It often results in a stack backtrace.)

如果你看一下86本手册或在osdev维基,有特定的含义此异常表在这种情况下(由于@MargaretBloom用于描述的):

If you look at x86 manuals or this exception table on the osdev wiki, there are specific meanings in this context (thanks to @MargaretBloom for the descriptions):

  • <强>阱:升高后的指令成功完成,则捕集研究所之后返回地址点. #DB调试和#OF溢出(into)异常是陷阱. ( #DB的一些来源是故障代替).但或其他软件中断指令也是陷阱,如(但是却让返回地址在<30>而不是推; 也不例外,因此不是一个真正的陷阱在这义)

  • trap: raised after an instruction successfully completed, the return address points after the trapping inst. #DB debug and #OF overflow ( into) exceptions are traps. (Some sources of #DB are faults instead) . But int 0x80 or other software interrupt instructions are also traps, as is syscall (but it puts the return address in rcx instead of pushing it; syscall is not an exception, and thus not really a trap in this sense)

<强>故障:一个尝试执行由再回滚后升高;返回地址指向错误指令. (大多数异常类型是错误)

fault: raised after an attempted execution is made and then rolled back; the return address points to the faulting instruction. (Most exception types are faults)

中止是指返回地址指向不相关的位置(即#DF双重故障和#MC机器检查).三重故障不能办理;这是当CPU命中异常试图运行双故障处理程序,确实停止整个CPU会发生什么.

abort is when the return address points to an unrelated location (i.e. for #DF double-fault and #MC machine-check). Triple fault can't be handled; it's what happens when the CPU hits an exception trying to run the double-fault handler, and really does stop the whole CPU.

请注意,即使是Intel的CPU架构师安迪GLEW有时用陷阱"更普遍,我想这意味着任何同步异常,用讨论计算机体系结构理论,当术语.不要指望别人坚持上述术语,除非你实际上是在谈论在x86处理特定的异常.虽然它是有用的和明智的术语,你可以在其它环境中使用它.但是,如果你想的区别,你要明确你通过每个术语的意思是使每个人在同一页面上.

Note that even Intel CPU architects like Andy Glew sometimes use the term "trap" more generally, I think meaning any synchronous exception, when using discussion computer-architecture theory. Don't expect people to stick to the above terminology unless you're actually talking about handling specific exceptions on x86. Although it is useful and sensible terminology, and you could use it in other contexts. But if you want to make the distinction, you should clarify what you mean by each term so everyone's on the same page.

这篇关于如果段故障无法恢复,为什么将它们称为故障(而不是中止)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆