地址规范形式和指针算法 [英] Address canonical form and pointer arithmetic

查看:23
本文介绍了地址规范形式和指针算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在符合 AMD64 的架构上,地址在取消引用之前需要采用规范形式.

On AMD64 compliant architectures, addresses need to be in canonical form before being dereferenced.

来自 英特尔手册,第 3.3.7.1 节:

在 64 位模式下,一个地址被认为是规范形式的,如果地址位 63 到最高有效位由微体系结构设置为全 1 或全 0.

In 64-bit mode, an address is considered to be in canonical form if address bits 63 through to the most-significant implemented bit by the microarchitecture are set to either all ones or all zeros.

现在,当前操作系统和架构上实现的最重要的位是第 47 位.这给我们留下了一个 48 位的地址空间.

Now, the most significat implemented bit on current operating systems and architectures is the 47th bit. This leaves us with a 48-bit address space.

特别是当 ASLR 启用时,用户程序可能会收到第 47 位的地址设置.

Especially when ASLR is enabled, user programs can expect to receive an address with the 47th bit set.

如果使用指针标记等优化并且使用高位存储信息,则程序必须确保在取消引用地址之前将第 48 至 63 位设置回第 47 位.

If optimizations such as pointer tagging are used and the upper bits are used to store information, the program must make sure the 48th to 63th bits are set back to whatever the 47th bit was before dereferencing the address.

但是考虑一下这段代码:

But consider this code:

int main()
{
    int* intArray = new int[100];

    int* it = intArray;

    // Fill the array with any value.
    for (int i = 0; i < 100; i++)
    {
        *it = 20;
        it++;   
    }

    delete [] intArray;
    return 0;
}

现在考虑 intArray 是,比如说:

Now consider that intArray is, say:

0000 0000 0000 0000 0111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1100

0000 0000 0000 0000 0111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1100

it 设置为 intArray 并增加一次 it 后,并考虑 sizeof(int) == 4,它会变成:

After setting it to intArray and increasing it once, and considering sizeof(int) == 4, it will become:

0000 0000 0000 0000 1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

第 47 位以粗体显示.这里发生的情况是,由指针算法检索到的第二个指针无效,因为不是规范形式.正确的地址应该是:

The 47th bit is in bold. What happens here is that the second pointer retrieved by pointer arithmetic is invalid because not in canonical form. The correct address should be:

1111 1111 1111 1111 1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

1111 1111 1111 1111 1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

程序如何处理这个问题?操作系统是否保证永远不会为您分配地址范围不随第 47 位变化的内存?

How do programs deal with this? Is there a guarantee by the OS that you will never be allocated memory whose address range does not vary by the 47th bit?

推荐答案

规范地址规则意味着 64 位虚拟地址空间中有一个巨大的漏洞.2^47-1 与其上方的下一个有效地址相邻,因此单个 mmap 不会包含任何不可用范围64 位地址.

The canonical address rules mean there is a giant hole in the 64-bit virtual address space. 2^47-1 is not contiguous with the next valid address above it, so a single mmap won't include any of the unusable range of 64-bit addresses.

+----------+
| 2^64-1   |   0xffffffffffffffff
| ...      |
| 2^64-2^47|   0xffff800000000000
+----------+
|          |
| unusable |      not to scale: this part is 2^16 times as large
|          |
+----------+
| 2^47-1   |   0x00007fffffffffff
| ...      |
| 0        |   0x0000000000000000
+----------+

此外,大多数内核保留规范范围的高半供自己使用.例如x86-64 Linux 的内存映射.无论如何,用户空间只能在连续的低范围内分配,因此间隙的存在无关紧要.

Also most kernels reserve the high half of the canonical range for their own use. e.g. x86-64 Linux's memory map. User-space can only allocate in the contiguous low range anyway so the existence of the gap is irrelevant.

操作系统是否保证永远不会为您分配地址范围不随第 47 位变化的内存?

Is there a guarantee by the OS that you will never be allocated memory whose address range does not vary by the 47th bit?

不完全是.当前硬件支持的 48 位地址空间是一个实现细节.规范地址规则确保未来的系统可以支持更多的虚拟地址位,而不会在很大程度上破坏向后兼容性.

Not exactly. The 48-bit address space supported by current hardware is an implementation detail. The canonical-address rules ensure that future systems can support more virtual address bits without breaking backwards compatibility to any significant degree.

至多,您只需要一个 compat 标志就可以让操作系统不为进程提供任何具有不同高位的内存区域.(就像 Linux 当前的 MAP_32BIT 标记用于 mmap 或进程范围的设置).这可以支持使用高位作为标签并手动重做符号扩展的程序.

At most, you'd just need a compat flag to have the OS not give the process any memory regions with high bits not all the same. (Like Linux's current MAP_32BIT flag for mmap, or a process-wide setting). That could support programs that used the high bits for tags and manually redid sign-extension.

未来的硬件不需要支持任何类型的标志来忽略高地址位与否,因为高位中的垃圾当前是一个错误.Intel 5-level paging 增加了另外 9 个虚拟地址位,扩大了规范高和低一半.白皮书.

Future hardware won't need to support any kind of flag to ignore high address bits or not, because junk in the high bits is currently an error. Intel 5-level paging adds another 9 virtual address bits, widening the canonical high andd low halves. white paper.

另见 为什么在 64 位中虚拟地址比物理地址(52 位长)短 4 位(48 位长)?

有趣的事实:Linux 默认将堆栈映射到有效地址较低范围的顶部.(相关:为什么 Linux 偏爱 0x7f 映射?)

Fun fact: Linux defaults to mapping the stack at the top of the lower range of valid addresses. (Related: Why does Linux favor 0x7f mappings?)

$ gdb /bin/ls
...
(gdb) b _start
Function "_start" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (_start) pending.
(gdb) r
Starting program: /bin/ls

Breakpoint 1, 0x00007ffff7dd9cd0 in _start () from /lib64/ld-linux-x86-64.so.2
(gdb) p $rsp
$1 = (void *) 0x7fffffffd850
(gdb) exit

$ calc
2^47-1
              0x7fffffffffff

(现代 GDB 可以使用 starti 在第一条用户空间指令执行之前中断,而不是使用断点命令.)

(Modern GDB can use starti to break before the first user-space instruction executes instead of messing around with breakpoint commands.)

这篇关于地址规范形式和指针算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆