边界检查64位硬件 [英] Bounds check in 64bit hardware

查看:137
本文介绍了边界检查64位硬件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 hacks.mozilla.org



作者指出:


对于 asm.js 代码,增加的地址空间也允许我们使用硬件内存保护来安全地删除 asm.js 堆访问。增益非常引人注目:根据 asmjs-apps - * - throughput 测试的8%-17% = break& suite = asmjs-apps> arewefastyet.com


我想了解64位硬件有自动边界检查(假设编译器做硬件支持)为C / C ++。我在SO找不到任何答案。我发现有关此主题的一篇技术文章,但我无法掌握如何这个做完了。



有人可以解释64位硬件辅助的边界检查吗?

解决方案

p>大多数现代CPU实现虚拟寻址/虚拟内存 - 当程序引用特定地址时,该地址是虚拟的;到物理页的映射(如果有的话)由CPU的MMU(存储器管理单元)实现。 CPU通过在页表中查找当前进程的操作系统,将每个虚拟地址转换为物理地址。这些查找由 TLB 缓存,因此大多数时间没有额外的延迟。 (在一些非x86 CPU设计中,TLB未命中由OS在软件中处理。)



因此,我的程序访问地址0x8050,标准的4096字节(0x1000)页大小)。 CPU看到虚拟页8被映射到物理页200,并且因此在物理地址 200 * 4096 + 0x50 == 0xC8050 处执行读取。 (正如TLB高速缓存页表查找一样,更熟悉的L1 / L2 / L3高速缓存访​​问物理RAM)。



当CPU没有该虚拟地址的TLB映射?这种事情频繁发生,因为TLB的大小有限。答案是CPU会生成一个由操作系统处理的页面错误



页面结果可能会产生多个结果错误:




  • 其中一个操作系统可以说哦,它只是不在TLB因为我不能适应它 。操作系统使用进程的页表映射从TLB中删除一个条目并填充新条目,然后让进程继续运行。在适度装载的机器上,每秒钟会发生数千次。 (在具有硬件TLB未命中处理的CPU上,例如x86,这种情况用硬件处理,甚至不是次要页错误。)

  • 二,操作系统可以说 ,那么虚拟页面现在没有映射,因为它使用的物理页面被交换到磁盘,因为我耗尽了内存。操作系统挂起进程,找到要使用的一些内存(可能通过交换一些其他虚拟映射),对为所请求的物理内存读取的磁盘排队,并且当磁盘读取完成时,使用新填充的页表映射来恢复该过程。 (这是主要页面错误。)

  • 三,该进程正试图访问不存在映射的内存 - 它正在读取不应该存在的内存。这通常称为分段错误。



相关情况是数字3.当发生segfault时,操作系统的默认行为是中止进程,做一些事情,比如写出一个核心文件。然而,一个进程被允许捕获它自己的segfaults并尝试处理它们,甚至可能不停止。这是事情变得有趣的地方。



我们可以使用它来执行硬件加速索引检查,但是还有一些绊脚石这样做。



首先,对于每个数组,我们把它放在自己的虚拟内存区域,所有包含数组数据的页面被映射照常。在真实阵列数据的任一侧,我们创建不可读和不可写的虚拟页映射。如果尝试在阵列外部读取,则会生成页面错误。编译器在创建程序时插入自己的页面错误处理程序,并处理页面错误,将其转换为索引超出范围异常。



Stumbling block number one 是我们只能将整个网页标记为可读或不可读。数组大小可能不是页面大小的偶数倍,因此我们有一个问题 - 我们不能在数组结尾之前和之后放置栅格。我们最好做的是在数组开始之前或者在数组和最近的fence页面之间的数组结尾之间留一个小间隙。



他们如何解决这个问题?好吧,在Java的情况下,编译执行负索引的代码并不容易;如果是的话,这并不重要,无论如何,因为负索引被视为无符号,这使索引远在数组的开头之前,这意味着它很可能会命中未映射的内存,并将导致一个故障。



因此,他们做的是对齐数组,使数组的末尾靠在页面的末尾,如下所示(' - '表示未映射,'+'表示映射):

  ----------- +++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++ ------------------ 
|第1页|第2页第3页Page 4 | Page 5 |第6页|第7页| ...
| ---------------- array ------------------------- - |

现在,如果索引超过数组的末尾,它将命中第7页, unmapped,这将导致页面错误,这将变成一个索引超出范围的异常。如果索引在数组开始之前(即,它是负数),那么因为它被当作一个无符号值,它会变得非常大和积极,使我们远远超过第7页再次导致未映射的内存读取,导致一个页面错误,这将再次变成一个索引超出范围的异常。



绊脚石2号是我们真的应该离开很多未映射的虚拟内存超过数组的末尾,在我们映射下一个对象之前,否则,如果一个索引超出边界,但远,远,超出范围,它可能会命中一个有效的页并且不会导致索引超出范围异常,而是读取或写入任意内存。



为了解决这个问题,我们只使用大量的虚拟内存 - 我们将每个数组放入其自己的4 GiB内存区域,其中只有前N页实际被映射。我们可以这样做,因为我们只是在这里使用地址空间,而不是实际的物理内存。 64位进程有4亿块大小的4 GiB内存区域,所以我们有足够的地址空间来使用,在我们用完之前。在32位CPU或进程上,我们有很少的地址空间来玩,所以这种技术不是很可行。因为它是,现在许多32位程序正在运行虚拟地址空间,只是试图访问实内存,永远不会试图映射空的围栏页面在该空间尝试使用硬件加速索引范围检查。 / p>

I was reading a blog on 64-bit Firefox edition on hacks.mozilla.org.

The author states:

For asm.js code, the increased address space also lets us use hardware memory protection to safely remove bounds checks from asm.js heap accesses. The gains are pretty dramatic: 8%-17% on the asmjs-apps-*-throughput tests as reported on arewefastyet.com.

I was trying to understand how 64-bit hardware have automatic bounds check (assuming compiler does with hardware support) for C/C++. I could not find any answers in SO. I found one technical paper on this subject, but I am not able to grasp how this is done.

Can someone explain 64-bit hardware aids in bounds check?

解决方案

Most modern CPUs implement virtual addressing/virtual memory - when a program references a particular address, that address is virtual; the mapping to a physical page, if any, is implemented by the CPU's MMU (memory management unit). The CPU translates every virtual address to a physical address by looking it up in the page table the OS set up for the current process. These lookups are cached by the TLB, so most of the time there's no extra delay. (In some non-x86 CPU designs, TLB misses are handled in software by the OS.)

So my program accesses address 0x8050, which is in virtual page 8 (assuming the standard 4096 byte (0x1000) page size). The CPU sees that virtual page 8 is mapped to physical page 200, and so performs a read at physical address 200 * 4096 + 0x50 == 0xC8050. (Just as the TLB caches page table lookups, the more familiar L1/L2/L3 caches cache accesses to physical RAM.)

What happens when the CPU does not have a TLB mapping for that virtual address? Such a thing occurs frequently because the TLB is of limited size. The answer is that the CPU generates a page fault, which is handled by the OS.

Several outcomes can occur as a result of a page fault:

  • One, the OS can say "oh, well it just wasn't in the TLB because I couldn't fit it". The OS evicts an entry from the TLB and stuffs in the new entry using the process's page table map, and then lets the process keep running. This happens thousands of times per second on moderately loaded machines. (On CPUs with hardware TLB miss handling, like x86, this case is handled in hardware, and is not even a "minor" page fault.)
  • Two, the OS can say "oh, well that virtual page isn't mapped right now because the physical page it was using was swapped to disk because I ran out of memory". The OS suspends the process, finds some memory to use (perhaps by swapping out some other virtual mapping), queues a disk read for the requested physical memory, and when the disk read completes, resumes the process with the freshly filled page table mapping. (This is a "major" page fault.)
  • Three, the process is trying to access memory for which no mapping exists - it's reading memory it shouldn't be. This is commonly called a segmentation fault.

The relevant case is number 3. When a segfault happens, the default behavior of the operating system is to abort the process and do things like write out a core file. However, a process is allowed to trap its own segfaults and attempt to handle them, perhaps even without stopping. This is where things get interesting.

We can use this to our advantage to perform 'hardware accelerated' index checks, but there are a few more stumbling blocks we hit trying to do so.

First, the general idea: for every array, we put it in its own virtual memory region, with all of the pages that contain the array data being mapped as usual. On either side of the real array data, we create virtual page mappings that are unreadable and unwritable. If you attempt to read outside of the array, you'll generate a page fault. The compiler inserts its own page fault handler when it made the program, and it handles the page fault, turning it into an index-out-of-bounds exception.

Stumbling block number one is that we can only mark whole pages as being readable or not. Array sizes may not be an even multiple of a page size, so we have a problem - we can't put fences exactly before and after the end of the array. The best we can do is leave a small gap either before the beginning of the array or after the end of the array between the array and the nearest 'fence' page.

How do they get around this? Well, in Java's case, it's not easy to compile code that performs negative indexing; and if it does, it doesn't matter anyway because the negative index is treated like it's unsigned, which puts the index far ahead of the beginning of the array, which means that it's very likely to hit unmapped memory and will cause a fault anyway.

So what they do is to align the array so that the end of the array butts up right against the end of a page, like so ('-' means unmapped, '+' means mapped):

-----------++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------
|  Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  | ...
                 |----------------array---------------------------|

Now, if the index is past end of the array, it'll hit page 7, which is unmapped, which will cause a page fault, which will turn into an index out of bounds exception. If the index is before the beginning of the array (that is, it's negative), then because it's treated as an unsigned value, it'll become very large and positive, putting us far past page 7 again causing an unmapped memory read, causing a page fault, which will again turn into an index out of bounds exception.

Stumbling block number 2 is that we really should leave a lot of unmapped virtual memory past the end of the array before we map the next object, otherwise, if an index was out of bounds, but far, far, far out of bounds, it might hit a valid page and not cause an index-out-of-bounds exception, and instead would read or write arbitrary memory.

In order to solve this, we just use huge amounts of virtual memory - we put each array into its own 4 GiB region of memory, of which only the first N few pages are actually mapped. We can do this because we're just using address space here, not actual physical memory. A 64 bit process has ~4 billion chunks of 4 GiB regions of memory, so we have plenty of address space to work with before we run out. On a 32-bit CPU or process, we have very little address space to play around with, so this technique isn't very feasible. As it is, many 32-bit programs today are running out of virtual address space just trying to access real memory, nevermind trying to map empty 'fence' pages in that space to try to use as 'hardware accelerated' index range checks.

这篇关于边界检查64位硬件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆