现代处理器中是否有对 128 位整数的硬件支持? [英] Is there hardware support for 128bit integers in modern processors?

查看:23
本文介绍了现代处理器中是否有对 128 位整数的硬件支持?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们是否仍然需要在软件中模拟 128 位整数,或者现在您的普通桌面处理器是否有硬件支持?

解决方案

我将通过将台式处理器与简单的微控制器进行比较来解释它,因为算术逻辑单元 (ALU) 的操作类似,它们是计算器在 CPU 中,

在执行二进制操作时,值溢出寄存器是很常见的(即太大而无法放入寄存器).ALU 有 n 位输入和 n 位输出,带有进位(即溢出)标志.

加法不能在一条指令中完成,但需要相对较少的指令.但是,对于乘法,您需要将字长加倍以适应结果,并且当您需要 2n 个输出时,ALU 只有 n 个输入和 n 个输出,因此无法正常工作.例如,通过将两个 32 位整数相乘,您需要一个 64 位结果,而两个 64 位整数需要最多 4 个字大小的寄存器的 128 位结果;2 还不错,但是 4 变得复杂了,并且您的寄存器用完了.CPU 处理此问题的方式将有所不同.对于 Cortex-M0,没有相关指令,但对于 Cortex-M3/M4,有一条指令用于 32x32=>64 位寄存器乘法,需要 3 个时钟周期.

(您可以使用 Cortex-M0 的 32x32 => 32 位 muls 作为 16x16=>32 位构建块以获得更大的乘法;这显然效率低下,但可能仍然比手动好移动和有条件地添加.)

AVR 架构

AVR 微控制器有 131 条指令,可在 32 个 8 位寄存器上运行,按寄存器宽度归类为 8 位处理器,但它同时具有 8 位和 16 位 ALU.

x86 架构

在 x86 上,可以使用 MUL 指令将两个 32 位整数相乘以创建 64 位整数,从而在 EDX:EAX 中产生一个无符号的 64 位,或在 RDX:RAX 对中产生一个 128 位的结果.

在 x86 上添加 64 位整数只需要两条指令(add/adc,多亏了进位标志),对于 x86-64 上的 128 位也是一样.但两个寄存器的整数相乘需要更多的工作.

例如,在 32 位 x86 上,64x64 =>64 位乘法 (long long) 需要 A LOT 指令,包括 3 个乘法(低 x 低加宽,叉积不需要,因为我们不需要完整结果的高 64 位).

x64 架构寄存器

当 CPU 将 2 个字大小的寄存器配对以创建单个双字大小的值时,在堆栈上,生成的双字值将与 RAM 中的字边界对齐.除了两个寄存器对之外,四字数学是一种软件技巧.这意味着对于 x64,可以将两个 64 位寄存器组合起来创建一个 128 位寄存器对溢出,该溢出与 RAM 中的 64 位字边界对齐,但 128x128=>128 位数学是软件黑客.

然而,x86/x64 是一个超标量 CPU,以及您知道的寄存器的只是架构寄存器.在幕后,还有更多寄存器可帮助优化 CPU 流水线,以使用多个 ALU 执行乱序指令.

SSE/SSE2 引入了 128 位 SIMD 寄存器,但没有指令将它们视为单个宽整数.有 paddq 可以做两个 64-位添加并行,但 不支持 128 位加法,甚至不支持手动传播跨元素的进位.最宽的乘法是两个 32x32=>64 并行运算,是 x86-64 标量 mul 宽度的一半.请参阅长整数例程能否从 SSE 中受益?了解艺术,以及为了从 SSE/AVX 中获得非常大的整数的任何好处而必须跳过的障碍.

即使使用 AVX-512(用于 512 位寄存器),最宽的 add/mul 指令仍然是 64 位元素.x86-64 确实引入了 64x64 =>SIMD 元素中的 64 位乘法.

简答

C++ 应用程序处理 128 位整数的方式会因操作系统或裸机调用约定而异.Microsoft 有自己的约定,令我感到沮丧的是,结果 128 位返回值不能作为单个值从函数返回.Microsoft x64 调用约定规定,在返回值时,您可以返回一个 64 位整数或两个 32 位整数.例如,您可以执行 word * word = dword,但在 Visual-C++ 中,您必须使用 _umul128 返回 HighProduct,无论它是否在 RDX:RAX 对中.我哭了,很难过.:-(

System-V 调用约定允许返回 128-RAX:RDX 中的位返回类型.https://godbolt.org/z/vdd8rK38e.(并且 GCC/clang 有 __int128 来让编译器发出必要的指令来 2-register add/sub/mul,以及 div 的辅助函数 - gcc 中有 128 位整数吗?)

至于是否应该依赖 128 位整数支持,遇到使用 32 位 x86 CPU 的用户非常罕见,因为它们太慢,因此设计软件以在 32 位上运行并不是最佳实践位 x86 CPU,因为它会增加开发成本并可能导致用户体验下降;期望 Athlon 64 或 Core 2 Duo 达到最低规格.您可以预期该代码在 Microsoft 上的性能不如 Unix 操作系统.

英特尔架构寄存器是一成不变的,但英特尔和 AMD 不断推出新的架构扩展,但编译器和应用程序需要很长时间才能更新,您不能指望跨平台.您需要阅读 Intel 64 和 IA-32 架构软件开发人员手册AMD64 程序员手册.

Do we still need to emulate 128bit integers in software, or is there hardware support for them in your average desktop processor these days?

解决方案

I'm going to explain it by comparing the desktop processors to simple microcontrollers because of the similar operation of the arithmetic logic units (ALU), which are the calculators in the CPU, and the Microsoft x64 Calling Convention vs the System-V Calling Convention. For the short answer scroll to the end, but the long answer is that it's easiest to see the difference by comparing the x86/x64 to ARM and AVR:

Long Answer

Native Double Word Integer Multiply Architecture Support Comparison

CPU word x word => dword dword x dword => dword
M0 No (only 32x32 => 32) No
AVR 8x8 => 16 (some versions only) No
M3/M4/A Yes (32x32 => 64) No
x86/x64 Yes (up to 64x64 => 128) Yes (up to 64x64 => 64 for x64)
SSE/SSE2/AVX/AVX2 Yes (32x32 => 64 SIMD elements) No (at most 32x32 => 32 SIMD elements)

If you understand this chart, skip to Short Answer

CPUs in smartphones, PCs, and Servers have multiple ALUs that perform calculations on registers of various widths. Microcontrollers on the other hand usually only have one ALU. The word-size of the CPU is not the same as the word size of the ALU, though they may be the same, the Cortex-M0 being a prime example.

ARM Architecture

The Cortex-M0 is a Thumb-2 processor, which is a compact (mostly 16-bit) instruction encoding for a 32-bit architecture. (Registers and ALU width). Cortex-M3/M4 have some more instructions, including smull / umull, 32x32 => 64-bit widening multiply that are helpful for extended precision. Despite these differences, all ARM CPUs share the same set of architectural registers, which is easy to upgrade from M0 to M3/M4 and faster Cortex-A series smartphone processors with NEON SIMD extensions.

ARM Architectural Registers

When performing a binary operation, it is common for the value to overflow a register (i.e. get too large to fit in the register). ALUs have n-bits input and n-bits output with a carryout (i.e. overflow) flag.

Addition cannot be performed in one instruction but requires relatively few instructions. However, for multiplication you will need to double the word size to fit the result and the ALU only has n inputs and n outputs when you need 2n outputs so that wouldn't work. For example, by multiplying two 32-bit integers you need a 64-bit result and two 64-bit integers require up to a 128-bit result with 4 word-sized registers; 2 is not bad, but 4 gets complicated and you run out of registers. The way the CPU handles this is going to be different. For the Cortex-M0 there are no instructions for that but with the Cortex-M3/M4 there is an instruction for 32x32=>64-bit register multiply that takes 3 clock cycles.

(You can use Cortex-M0's 32x32 => 32-bit muls as a 16x16=>32-bit building block for larger multiplies; this is obviously inefficient but probably still better than manually shifting and conditionally adding.)

AVR Architecture

The AVR microcontroller has 131 instructions that work on 32 8-bit registers and is classified as an 8-bit processor by its register width but it has both an 8-bit and a 16-bit ALU. The AVR processor cannot do 16x16=>32-bit calculations with two 16-bit register pairs or 64-bit integer math without a software hack. This is the opposite of the x86/x64 design in both organizations of registers and ALU overflow operation. This is why AVR is classified as an 8/16-bit CPU. Why do you care? It affects performance and interrupt behavior.

AVR "tiny", and other devices without the "enhanced" instruction-set don't have hardware multiply at all. But if supported at all, the mul instruction is 8x8 => 16-bit hardware multiply. https://godbolt.org/z/7bbqKn7Go shows how GCC uses it.

AVR Architectural Registers

x86 Architecture

On x86, multiplying two 32-bit integers to create a 64-bit integer can be done with the the MUL instruction resulting in a unsigned 64-bit in EDX:EAX, or 128-bit result in RDX:RAX pair.

Adding 64-bit integers on x86 requires only two instructions (add/adc thanks to the carry flag), same for 128-bit on x86-64. But multiplying two-register integers takes more work.

On 32-bit x86 for example, 64x64 => 64-bit multiplication (long long) requires A LOT of instructions, including 3 multiplies (with the low x low widening, the cross products not, because we don't need the high 64 bits of the full result). Here is an example of 32x64=>64-bit x86 signed multiply assembly for x86:

 movl 16(%ebp), %esi    ; get y_l
 movl 12(%ebp), %eax    ; get x_l
 movl %eax, %edx
 sarl $31, %edx         ; get x_h, (x >>a 31), higher 32 bits of sign-extension of x
 movl 20(%ebp), %ecx    ; get y_h
 imull %eax, %ecx       ; compute s: x_l*y_h
 movl %edx, %ebx
 imull %esi, %ebx       ; compute t: x_h*y_l
 addl %ebx, %ecx        ; compute s + t
 mull %esi              ; compute u: x_l*y_l
 leal (%ecx,%edx), %edx ; u_h += (s + t), result is u
 movl 8(%ebp), %ecx
 movl %eax, (%ecx)
 movl %edx, 4(%ecx)

x86 supports pairing up two registers to store the full multiply result (including the high-half), but you can't use the two registers to perform the task of a 64-bit ALU. This is the primary reason why x64 software runs faster than x86 software for 64-bit or wider integer math: you can do the work in a single instruction! You could imagine that 128-bit multiplication in x86 mode would be very computationally expensive, it is. The x64 is very similar to x86 except with twice the number of bits.

x86 Architectural Registers

x64 Architectural Registers

When CPUs pair 2 word-sized registers to create a single double word-sized value, On the stack the resulting double word value will be aligned to a word boundary in RAM. Beyond the two register pair, four-word math is a software hack. This means that for x64 two 64-bit registers may be combined to create a 128-bit register pair overflow that gets aligned to a 64-bit word boundary in RAM, but 128x128=>128-bit math is a software hack.

The x86/x64, however, is a superscalar CPU, and the registers you know of are merely the architectural registers. Behind the scenes, there are a lot more registers that help optimize the CPU pipeline to perform out of order instructions using multiple ALUs.

SSE/SSE2 introduced 128-bit SIMD registers, but no instructions treat them as a single wide integer. There's paddq that does two 64-bit additions in parallel, but no hardware support for 128-bit addition, or even support for manually propagating carry across elements. The widest multiply is two 32x32=>64 operations in parallel, half the width of what you can do with x86-64 scalar mul. See Can long integer routines benefit from SSE? for the state of the art, and the hoops you have to jump through to get any benefit from SSE/AVX for very big integers.

Even with AVX-512 (for 512-bit registers), the widest add / mul instructions are still 64-bit elements. x86-64 did introduce 64x64 => 64-bit multiply in SIMD elements.

Short Answer

The way that C++ applications will handle 128-bit integers will differ based on the Operating System or bare metal calling a convention. Microsoft has their own convention that, much to my own dismay, the resulting 128-bit return value CAN NOT be returned from a function as a single value. The Microsoft x64 Calling Convention dictates that when returning a value, you may return one 64-bit integer or two 32-bit integers. For example, you can do word * word = dword, but in Visual-C++ you must use _umul128 to return the HighProduct, regardless of it being in the RDX:RAX pair. I cried, it was sad. :-(

The System-V calling convention, however, does allow for returning 128-bit return types in RAX:RDX. https://godbolt.org/z/vdd8rK38e. (And GCC / clang have __int128 to get the compiler to emit the necessary instructions to 2-register add/sub/mul, and helper function for div - Is there a 128 bit integer in gcc?)

As for whether you should count on 128-bit integer support, it's extremely rare to come across a user using a 32-bit x86 CPU because they are too slow so it is not best practice to design software to run on 32-bit x86 CPUs because it increases development costs and may lead to a degraded user experience; expect an Athlon 64 or Core 2 Duo to the minimum spec. You can expect the code to not perform as well on Microsoft as Unix OS(s).

The Intel architecture registers are set in stone, but Intel and AMD are constantly rolling out new architecture extensions but compilers and apps take a long time to update you can't count on it for cross-platform. You'll want to read the Intel 64 and IA-32 Architecture Software Developer’s Manual and AMD64 Programmers Manual.

这篇关于现代处理器中是否有对 128 位整数的硬件支持?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆