使用环形总线拓扑的 Intel CPU 如何解码和处理端口 I/O 操作 [英] How do Intel CPUs that use the ring bus topology decode and handle port I/O operations

查看:22
本文介绍了使用环形总线拓扑的 Intel CPU 如何解码和处理端口 I/O 操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从硬件抽象级别理解端口 I/O(即断言一个引脚,该引脚向总线上的设备指示地址是端口地址,这在具有简单地址总线模型的早期 CPU 上有意义)但我'我不太确定它是如何在现代 CPU 微架构上实现的,尤其是端口 I/O 操作如何出现在环形总线上.

I understand Port I/O from a hardware abstraction level (i.e. asserts a pin that indicates to devices on the bus that the address is a port address, which makes sense on earlier CPUs with a simple address bus model) but I'm not really sure how it's implemented on modern CPUs microarchitecturally but also particularly how the Port I/O operation appears on the ring bus.

首先.IN/OUT 指令分配给保留站还是加载/存储缓冲区?我最初的想法是它将在加载/存储缓冲区中分配并且内存调度程序识别它,将它发送到 L1d,表明它是一个端口映射操作.分配一个行填充缓冲区并将其发送到 L2,然后发送到环.我猜测环上的消息有一些只有系统代理接受的端口映射指示符,然后它检查其内部组件并将端口映射指示的请求中继给它们;即 PCIe 根桥接器会选择 CF8h 和 CFCh.我猜想 DMI 控制器是固定的,可以选择将出现在 PCH 上的所有标准化端口,例如用于传统 DMA 控制器的端口.

Firstly. Where does the IN/OUT instruction get allocated to, the reservation station or the load/store buffer? My initial thoughts were that it would be allocated in the load/store buffer and the memory scheduler recognises it, sends it to the L1d indicating that it is a port-mapped operation. A line fill buffer is allocated and it gets sent to L2 and then to the ring. I'm guessing that the message on the ring has some port-mapped indicator which only the system agent accepts and then it checks its internal components and relays the port-mapped indicated request to them; i.e. PCIe root bridge would pick up CF8h and CFCh. I'm guessing the DMI controller is fixed to pick up all the standardised ports that will appear on the PCH, such as the one for the legacy DMA controller.

推荐答案

INOUT 指令的执行取决于处理器的运行模式.在实模式下,执行指令不需要检查权限.在所有其他模式下,都需要检查Flags寄存器的IOPL字段和与当前硬件任务相关的I/O权限映射,以确定是否允许执行IN/OUT指令.此外,IN/OUT 指令具有比LFENCE 强但比完全序列化指令弱的序列化特性.根据英特尔手册第 3 卷的第 8.2.5 节:

The execution of the IN and OUT instructions depends on the operating mode of the processor. In real mode, no permissions need to be checked to execute the instructions. In all other modes, the IOPL field of the Flags register and the I/O permission map associated with the current hardware task need to be checked to determine whether the IN/OUT instruction is allowed to execute. In addition, the IN/OUT instruction has serialization properties that are stronger than LFENCE but weaker than a fully serializing instruction. According to Section 8.2.5 of the Intel manual volume 3:

总线上的内存映射设备和其他 I/O 设备经常对其 I/O 缓冲区的写入顺序敏感.输入/输出指令可用于(IN 和 OUT 指令)强加对此类访问的强写顺序如下.执行前一条 I/O 指令,处理器等待所有之前的指令在程序中完成并为所有缓冲写入排空记忆.只有取指令和页表遍历才能通过 I/O指示.后续指令的执行直到处理器确定 I/O 指令已完成.

Memory mapped devices and other I/O devices on the bus are often sensitive to the order of writes to their I/O buffers. I/O instructions can be used to (the IN and OUT instructions) impose strong write ordering on such accesses as follows. Prior to executing an I/O instruction, the processor waits for all previous instructions in the program to complete and for all buffered writes to drain to memory. Only instruction fetch and page tables walks can pass I/O instructions. Execution of subsequent instructions do not begin until the processor determines that the I/O instruction has been completed.

这个描述表明,IN/OUT 指令完全阻塞了流水线的分配阶段,直到所有先前的指令都被执行并且存储缓冲区和 WCB 被排空,然后 IN/OUT 指令退出.为了实现这些序列化属性并执行必要的操作模式和权限检查,IN/OUT 指令需要被解码为许多微指令.有关如何实施此类指令的更多信息,请参阅:What管道中的软件中断发生了什么?.

This description suggests that an IN/OUT instruction completely blocks the allocation stage of the pipeline until all previous instructions are executed and the store buffer and WCBs are drained and then the IN/OUT instruction retires. To implement these serialization properties and to perform the necessary operating mode and permission checks, the IN/OUT instruction needs to be decoded into many uops. For more information on how such an instruction can be implemented, refer to: What happens to software interrupts in the pipeline?.

英特尔优化手册的旧版本确实为INOUT 指令提供了延迟和吞吐量数字.他们似乎都在说最坏情况下的延迟是 225 个周期,而吞吐量恰好是每条指令 40 个周期.但是,这些数字对我来说没有多大意义,因为我认为延迟取决于读取或写入的 I/O 设备.并且因为这些指令基本上是序列化的,所以延迟本质上决定了吞吐量.

Older versions of the Intel optimization manual did provide latency and throughput numbers for the IN and OUT instructions. All of them seem to say that the worst-case latency is 225 cycles and the throughput is exactly 40 cycles per instruction. However, these numbers don't make much sense to me because I think the latency depends on the I/O device being read from or written to. And because these instructions are basically serialized, the latency essentially determines throughput.

我已经在 Haswell 上测试了 in al, 80h 指令.根据@MargaretBloom 的说法,从端口 0x80 读取一个字节是安全的(根据 osdev.org 映射到某个 DMA 控制器寄存器).这是我发现的:

I've tested the in al, 80h instruction on Haswell. According to @MargaretBloom, it's safe to read a byte from the port 0x80 (which according to osdev.org is mapped to some DMA controller register). Here is what I found:

  • 该指令被MEM_UOPS_RETIRED.ALL_LOADS 算作单个加载uop.它也被视为错过 L1D 的负载 uop.但是,它不计为命中 L1D 或未命中或命中 L2 或 L3 缓存的加载 uop.
  • uop 的分布如下:p0:16.4、p1:20、p2:1.2、p3:2.9、p4:0.07、p5:16.2、p6:42.8,最后是 p7:0.04.在 al, 80h 指令中,总共有 99.6 uop.
  • in al, 80h 的吞吐量是每条指令 3478 个周期.我认为吞吐量取决于 I/O 设备.
  • 根据L1D_PEND_MISS.PENDING_CYCLES,I/O 加载请求似乎在一个 LFB 中分配了一个周期.
  • 当我添加依赖于 in 指令结果的 IMUL 指令时,总执行时间不会改变.这表明 in 指令不会完全阻塞分配阶段,直到它的所有 uops 都退出了,并且它可能与后面的指令重叠,这与我对手册的解释相反.
  • The instruction is counted as a single load uop by MEM_UOPS_RETIRED.ALL_LOADS. It's also counted as a load uop that misses the L1D. However, it's not counted as a load uop that hits the L1D or misses or hits the L2 or L3 caches.
  • The distribution of uops is as follows: p0:16.4, p1:20, p2:1.2, p3:2.9, p4:0.07, p5:16.2, p6:42.8, and finally p7:0.04. That's a total of 99.6 uops per in al, 80h instruction.
  • The throughput of in al, 80h is 3478 cycles per instruction. I think the throughput depends on the I/O device though.
  • According to L1D_PEND_MISS.PENDING_CYCLES, the I/O load request seems to be allocated in an LFB for one cycle.
  • When I add an IMUL instruction that is dependent on the result of in instruction, the total execution time does not change. This suggests that the in instruction does not completely block the allocation stage until all of its uops are retired and it may overlap with later instructions, in contrast to my interpretation of the manual.

我已经在 Haswell 上针对端口 0x3FF、0x2FF、0x3EF 和 0x2EF 测试了 out dx, al 指令.uops的分布如下:p0:10.9、p1:15.2、p2:1、p3:1、p4:1、p5:11.3、p6:25.3,最后是p7:1.每条指令总共有 66.7 uop.out 到 0x2FF、0x3EF 和 0x2EF 的吞吐量为 1880c.out 到 0x3FF 的吞吐量是 6644.7c.out 指令不算作退休存储.

I've tested the out dx, al instruction on Haswell for ports 0x3FF, 0x2FF, 0x3EF, and 0x2EF. The distribution of uops is as follows: p0:10.9, p1:15.2, p2:1, p3:1, p4:1, p5:11.3, p6:25.3, and finally p7:1. That's a total of 66.7 uops per instruction. The throughput of out to 0x2FF, 0x3EF, and 0x2EF is 1880c. The throughput of out to 0x3FF is 6644.7c. The out instruction is not counted as a retired store.

一旦 I/O 加载或存储请求到达系统代理,它就可以通过查询其系统 I/O 映射表来确定如何处理该请求.此表取决于芯片组.一些 I/O 端口是静态映射的,而另一些是动态映射的.例如参见 Intel 100 系列芯片组数据表,用于 Skylake 处理器.请求完成后,系统代理将响应发送回处理器,以便处理器可以完全退出 I/O 指令.

Once the I/O load or store request reaches the system agent, it can determine what to do with the request by consulting its system I/O mapping table. This table depends on the chipset. Some I/O ports are mapped statically while other are mapped dynamically. See for example Section 4.2 of the Intel 100 Series Chipset datasheet, which is used for Skylake processors. Once the request is completed, the system agent sends a response back to the processor so that it can fully retire the I/O instruction.

这篇关于使用环形总线拓扑的 Intel CPU 如何解码和处理端口 I/O 操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆