有没有采用双泵64位运算的P4模型？ [英] Was there a P4 model with double-pumped 64-bit operations?

查看：95 发布时间：2020/10/11 0:06:50 x86 x86-64 intel cpu-architecture

本文介绍了有没有采用双泵64位运算的P4模型？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我记得最初的P4微体系结构的有趣特征之一是它的

该图显示了一个32位交错的ALU单元。这证实了ALU可以在三个快速周期（其中一个快速周期是主时钟周期的一半）中执行两个完全相关（两个输入操作数都相关）的简单ALU操作。运算结果本身在2个快速周期（1个主周期）之后可用，但是新标志仅在第3个快速周期（1.5个主周期）之后可用。请注意，端口0和1上有两个这样的ALU，它们都是交错的。因此，该设计可以执行2个依赖的ALU链，每个慢循环吞吐量需要4个操作。

该论文于2001年发表。英特尔又发表了论文 ⁴在2005年，在电路上进行了详细讨论英特尔®奔腾4 Prescott ⁵处理器中交错整数整数的级别。我不清楚本文是讨论64位版本的Prescott还是32位版本。但是，本文清楚地指出，交错的ALU单元只能执行加法，布尔运算，移位和旋转（另一篇论文讨论了Pre-Prescott核的设计，其中两个快速ALU单元不支持移位和旋转）。另一个重要的区别是该文件中的语句：

有两个截然不同的32位FCLK执行数据路径，其中
错开了一个时钟来实现64位操作。

因此，似乎端口0和1上的两个快速ALU单元错开了，启用64位快速整数运算（例如加法）。因此，设计可以执行两个32位相关性ALU链，每个慢速循环吞吐量执行4个操作，或者执行一个64位相关性ALU链，每个慢速循环吞吐量执行2个操作。这比单个交错的64位ALU更为强大，后者只能执行64位操作，而不能执行32位操作。这很可能是NetBurst微体系结构的64位变体中使用的设计。

另一个 ⁶ 论文 ⁷来自英特尔确认英特尔确实能够设计出双泵64位ALU。我从论文中引用：

在本文中，我们描述了在90nm双Vt中制造的单周期整数ALU
CMOS技术在
的64b模式下以4GHz运行，具有32GHz模式的7GHz延迟（在
1.3V，25℃时测得）。

该论文没有提及这种设计是否实际上已在任何特定处理器中使用。但是考虑到该论文发表于2004年，很有可能所有64位NetBurst内核（无论已发布还是已取消）都使用了该设计。

有英特尔已经发布了许多基于NetBurst的64位处理器。例如，请参阅此列表以了解服务器级处理器。核心之一称为Nocona。有一些实验证据表明，前面提到的设计（2个交错的32位ALU）实际上是在Nocona中使用的。请参阅这些在2008年CMU教授的有关代码优化的某些课程中使用的幻灯片。幻灯片比较了Nocona（64位NetBurst），Intel Core（也是64位）和AMD Opteron（也是64位，显然实现了相同的64位交错ALU设计）的性能。这是循环中使用的代码：

  x = x + d [i];

其中所有元素都是32位整数（不幸的是，未使用64位）。 / p>

在幻灯片35上，您可以看到在Nocona和Opteron上实现的32位整数加法吞吐量。由于每个操作都需要一个负载，并且Nocona每个周期仅支持一个负载，因此Nocona的性能最高可以达到每个周期1次左右。但是，Opteron每个周期支持两个负载，因此接近理论上每个周期最多2次操作的最大值。当然，该实验没有利用交错的优势，而只是利用了两个32位简单ALU的事实。

但是，在幻灯片的后面，SSE3是用于代替标量整数寄存器。这三个处理器的所有结果都显示在幻灯片44上。使用SSE3，每4个元素只有一个128位负载。 Nocona可以每个周期从L1D执行64位加载（请参见下面引用的文章），而Core可以每个周期执行单个128位L1D加载。但是，Core具有称为高级数字媒体增强功能（ADMB），使其每个周期执行4个32位加法。同一篇论文还提到，内核前架构每个周期仅支持2个32位SSE3 ALU操作。但是，如果Nocona中有两个32位交错式ALU，则SSE3吞吐量低意味着SSE3操作仅使用交错式ALU中的一个。 ADMB可以通过两种方式实现。通过将每个ALU扩展为64位并使它们交错，然后利用两个ALU在每个周期内执行2个64位ALU操作。另一种可能性是将每个ALU扩展到128位并消除交错。

有一个专利由Intel于1998年提交，并于2001年在指令的交错执行上获得授权，基本上是任何指令，而不仅仅是ALU操作。该专利仍然有效。关于128位SIMD指令的交错执行如何有用，这里有很多讨论。基于这项专利，英特尔酷睿很可能使用两个64位交错式ALU来实现其吞吐量。每个64位ALU实际上都可以使用上图中所示的两个交错的32位ALU制成。

2002年，英特尔提交了专利，用于通用交错式ALU设计。从某种意义上讲，它是通用的，它与任何特定的ALU操作，时钟周期数或时钟周期无关。有趣的是，其中一个图显示了交错的64位ALU设计！那是在2002年。该专利还讨论了交错ALU设计中的一些挑战。

该专利说，该专利在2006年的同一天被授予和被放弃。然后几个月后，又提交了另一个相同的专利申请。

此文章显示了Potomac（另一台服务器-奔腾4级）是64位体系结构，每个周期支持4个64位。 Yamhill和Jayhawk被英特尔取消。（文章中有错误： Nocona 是64位CPU 。）

（1）如果链接断开，本文的标题为奔腾®4的微体系结构。处理器，由Glenn Hinton等人撰写。

（2）也称为第一代奔腾4。

（3）也称为交错ALU。

（4）如果链接断开，本文的标题为低电压摆幅逻辑电路，用于Pentium®4 Processor Integer Core，由Daniel J. Deleganes等人撰写。

（5）也称为第三代奔腾4 。

（6）万一链路断开，该论文的标题为在90nm CMOS中具有双电源电压的4GHz 300mW 64b整数执行ALU，由Sanu K. Mathew等人撰写。

（7）如果链接断开，该论文的标题为高性能能效双供ALU设计，由Sanu K. Mathew等人撰写。

I recall that one of the interesting features of the initial P4 micro-architecture was it's double-pumped ALU. I think Intel called it something like the Rapid Execution Unit, but basically it meant that each execution unit in the ALU was effectively running at twice the frequency, and could handle two simple ALU operations in a single cycle, even if they were dependent.

This feature disappeared at some point (before or at the same time as the P4), but was there ever a 64-bit P4 with a double dumped ALU? The 64-bit variants of the P4 came out in 2004, about four years after the initial 32-bit release, but it isn't clear to me if the double-speed ALU had disappeared by then. It seems like the width-pipelined approach used to double the speed would be difficult for 64-bit which is what piqued my curiosity.

Since one may still need to support some (evidently quite old) 64-bit P4 hardware, knowing the ALU behavior is interesting for optimization.

解决方案

I found the Intel Optimization Manual 2005 that covers both 32-bit and 64-bit NetBurst processors. Refer to Table C-8 on page C-17. According to the first comment on this blog post, the 32-bit Northwood's model is 02h and the 64-bit Nocona's model is 03h. The table shows that ADD/SUB/AND/OR/XOR have a throughput of 0.5 cycles on both processors, but a latency of 0.5 cycles on Northwood and 1 cycle on Nocona. This means that double-pumping is supported on Nocona, but only if the back-to-back instructions are not dependent. The rest of the table also shows that some instructions that were not double-pumped on Northwood were double-pumped on Nocona.

Summary: There is ample evidence that shows that some NetBurst-based processors (whether released or canceled) could perform at least 2 64-bit ALU operations per cycle using either 2 32-bit staggered ALUs or at least a single 64-bit staggered ALU (which would be enabled by smaller feature sizes such as 90nm at that time).

Figure 7 of the original paper¹ on Intel Pentium 4 Willamette² processor discusses how the double-pumped³ ALU works in some detail (at the logic design level).

The figure shows a single 32-bit staggered ALU unit. This confirms that the ALU can perform two fully dependent (both input operands are dependent) simple ALU operations in three fast cycles (where a fast cycle is one half of the main clock cycle). The result of the operation itself is available after 2 fast cycles (1 main cycle), but the new flags are only available after the third fast cycle (1.5 main cycles). Note that there are two such ALUs on ports 0 and 1, both are staggered. So the design could execute 2 dependency ALU chains with 4 operations per slow cycle throughput.

That paper was published in 2001. Intel has published another paper⁴ in 2005 that discusses in great detail at the circuit level how the staggered integer core in the Intel Pentium 4 Prescott⁵ processor. It's not clear to me whether the paper discusses the 64-bit version of Prescott or the 32-bit version. However, this paper clearly states that the staggered ALU units can only perform additions, Boolean operations, shifts, and rotations (the other paper discussed the design of pre-Prescott cores in which the two fast ALU units did not support shifting and rotating). The other important difference is this statement from the paper:

There are two distinct 32-bit FCLK execution data paths staggered by one clock to implement 64-bit operations.

So it seems that the two fast ALU units on ports 0 and 1 are staggered together, enabling 64-bit fast integer operations such as additions. Therefore, the design could execute either two 32-bit dependency ALU chains with 4 operations per slow cycle throughput or one 64-bit dependency ALU chain with 2 operations per slow cycle throughput. This is even more powerful than a single staggered 64-bit ALU that can do only 64 bit operations, not 32-bit ones. The is most probably the design used in the 64-bit variants of the NetBurst microarchitecture.

Another⁶ paper⁷ from Intel confirms that Intel was indeed able to design a double-pumped 64-bit ALU. I quote from the paper:

In this paper, we describe a single-cycle integer ALU fabricated in 90nm dual-Vt CMOS technology operating at 4GHz in the 64b mode, with a 32b mode latency of 7GHz (measured at 1.3V, 25◦C).

The paper doesn't mention whether this design has actually being used in any particular processor. But considering that the paper was published in 2004, there is a good chance that all of the 64-bit NetBurst cores (whether released or canceled) used the design.

There are many 64-bit NetBurst-based processors that have released by Intel. For example, see this list for the server-grade processors. One of the cores is called Nocona. There is some experimental evidence that the design mentioned earlier (2 staggered 32-bit ALUs) was actually used in Nocona. Refer to these slides used in some course taught in CMU in 2008 on code optimization. The slides compare between the performance of Nocona (64-bit NetBurst), Intel Core (also 64-bit), and AMD Opteron (also 64-bit and apparently implements the same 64-bit staggered ALU design). This is the code used in a loop:

x = x + d[i];

where all elements are 32-bit integers (unfortunately, 64-bits have not been used).

On slide 35, you can see the 32-bit integer addition throughput achieved on Nocona and Opteron. Since each operation requires a load and Nocona only supports a single load per cycle, Nocona's performance maxed out at around 1 operation per cycle. Opteron, however, which supports two loads per cycle, was close to the theoretical maximum of 2 operations per cycle. This experiment of course does not take advantage of staggering, but only of the fact that there are two 32-bit simple ALUs.

However, later in the slides, SSE3 is used instead of scalar integer registers. The results for all of the three processors are shown on slide 44. With SSE3, there will be only one 128-bit load per 4 elements. Nocona can perform a 64-bit load from the L1D per cycle (see the article cited below), while Core can perform a single 128-bit L1D load per cycle. However, Core has a feature called Advanced Digital Media Boost (ADMB) that enables it to perform 4 32-bit addition per cycle. That same paper also mentions that pre-Core architectures supported only 2 32-bit SSE3 ALU operations per cycle. But if there are two 32-bit staggered ALUs in Nocona, the low SSE3 throughput implies that an SSE3 operation makes use of only one of the staggered ALUs. ADMB can be implemented in two ways. Either by expanding each ALU to 64-bits and keeping them staggered and utilizing both ALUs to perform 2 64-bit ALU operations per cycle. Another possibility is expanding each ALU to 128-bit and eliminate staggering.

There is a patent filed by Intel in 1998 and granted in 2001 on the staggered execution of an instruction, any instruction basically, not just ALU operations. That patent is still active. There is a lot of discussion there on how staggered execution can be useful for 128-bit SIMD instructions. Based on this patent, it's very possible that Intel Core uses two 64-bit staggered ALUs to achieved its throughput. Each of the 64-bit ALUs can actually be made using two staggered 32-bit ALUs shown in the figure above.

In 2002, Intel filed a patent for a generic staggered ALU design. It was generic in the sense that it was not about any specific ALU operation or the number of clock cycles or the clock period. The interesting thing here is that one of the figure there shows a staggered 64-bit ALU design! That was in 2002. The patent also discusses some of the challenges in designing staggered ALUs.

The patent says that it was both granted and abandoned on the same day in 2006. Then after few months, another identical patent application was filed.

This article shows that Potomac (another server-grade Pentium 4) is 64-bit architecture and supports 4 64-bit per cycle. Yamhill and Jayhawk were canceled by Intel. (There is an error in the article: Nocona is a 64-bit CPU.)

(1) In case the link goes down, the paper is titled "The Microarchitecture of the Pentium® 4 Processor" and authored by Glenn Hinton, et al.

(2) Also known as the first-gen Pentium 4.

(3) Also known as staggered ALU.

(4) In case the link goes down, the paper is titled "Low-Voltage Swing Logic Circuits for a Pentium® 4 Processor Integer Core" and authored by Daniel J. Deleganes, et al.

(5) Also known as the third-gen Pentium 4.

(6) In case the link goes down, the paper is titled "A 4GHz 300mW 64b Integer Execution ALU with Dual Supply Voltages in 90nm CMOS" and authored by Sanu K. Mathew, et al.

(7) In case the link goes down, the paper is titled "HIGH-PERFORMANCE ENERGY-EFFICIENT DUAL-SUPPLY ALU DESIGN" and authored by Sanu K. Mathew, et al.

这篇关于有没有采用双泵64位运算的P4模型？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

有没有采用双泵64位运算的P4模型？ [英] Was there a P4 model with double-pumped 64-bit operations?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

有没有采用双泵64位运算的P4模型？ [英] Was there a P4 model with double-pumped 64-bit operations?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭