预测现代超标量处理器上的操作的延迟需要考虑哪些因素,我如何手动计算它们? [英] What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

查看:27
本文介绍了预测现代超标量处理器上的操作的延迟需要考虑哪些因素,我如何手动计算它们?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够手动预测任意算术运算(即没有分支或内存,尽管这也很好)的确切时间长度,考虑到指令重新排序,x86-64 汇编代码将在给定特定架构的情况下使用、超标量、延迟、CPI 等

I want to be able to predict, by hand, exactly how long arbitrary arithmetical (i.e. no branching or memory, though that would be nice too) x86-64 assembly code will take given a particular architecture, taking into account instruction reordering, superscalarity, latencies, CPIs, etc.

实现这一目标必须遵循什么/描述规则?

What / describe the rules must be followed to achieve this?

我想我已经弄清楚了一些初步规则,但是我找不到任何将任何示例代码分解到这种详细程度的参考资料,因此我不得不进行一些猜测.(例如,英特尔优化手册几乎没有提到指令重新排序.)

I think I've got some preliminary rules figured out, but I haven't been able to find any references on breaking down any example code to this level of detail, so I've had to take some guesses. (For example, the Intel optimization manual barely even mentions instruction reordering.)

至少,我正在寻找 (1) 确认每条规则是正确的,或者每条规则的正确陈述,以及 (2) 我可能忘记的任何规则的列表.

At minimum, I'm looking for (1) confirmation that each rule is correct or else a correct statement of each rule, and (2) a list of any rules that I may have forgotten.

  • 每个周期发出尽可能多的指令,从当前周期开始按顺序开始,并可能尽可能提前到重新排序缓冲区大小.
  • 如果满足以下条件,可以在给定周期内发出指令:
    • 没有影响其操作数的指令仍在执行.并且:
    • 如果是浮点指令,则其之前的每条浮点指令都已发出(浮点指令具有静态指令重新排序).并且:
    • 有一个功能单元可用于该循环中的该指令.每个 (?) 功能单元都是流水线化的,这意味着它每个周期可以接受 1 个新指令,并且对于给定功能类的 CPI,总功能单元的数量是 1/CPI(此处含糊不清:大概例如 addpssubps 使用相同的功能单元?我如何确定这一点?).并且:
    • 此周期已发出的指令数少于超标量宽度(通常为 4).
    • As many instructions as possible are issued each cycle, starting in-order from the current cycle and potentially as far ahead as the reorder buffer size.
    • An instruction can be issued on a given cycle if:
      • No instructions that affect its operands are still being executed. And:
      • If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering). And:
      • There is a functional unit available for that instruction on that cycle. Every (?) functional unit is pipelined, meaning it can accept 1 new instruction per cycle, and the number of total functional units is 1/CPI, for the CPI of a given function class (nebulous here: presumably e.g. addps and subps use the same functional unit? How do I determine this?). And:
      • Fewer than the superscalar width (typically 4) number of instructions have already been issued this cycle.

      例如,考虑以下示例代码(计算叉积):

      As an example, consider the following example code (which computes a cross-product):

      shufps   xmm3, xmm2, 210
      shufps   xmm0, xmm1, 201
      shufps   xmm2, xmm2, 201
      mulps    xmm0, xmm3
      shufps   xmm1, xmm1, 210
      mulps    xmm1, xmm2
      subps    xmm0, xmm1
      

      我尝试预测 Haswell 的延迟看起来像这样:

      My attempt to predict the latency for Haswell looks something like this:

      ; `mulps`  Haswell latency=5, CPI=0.5
      ; `shufps` Haswell latency=1, CPI=1
      ; `subps`  Haswell latency=3, CPI=1
      
      shufps   xmm3, xmm2, 210   ; cycle  1
      shufps   xmm0, xmm1, 201   ; cycle  2
      shufps   xmm2, xmm2, 201   ; cycle  3
      mulps    xmm0, xmm3        ;   (superscalar execution)
      shufps   xmm1, xmm1, 210   ; cycle  4
      mulps    xmm1, xmm2        ; cycle  5
                                 ; cycle  6 (stall `xmm0` and `xmm1`)
                                 ; cycle  7 (stall `xmm1`)
                                 ; cycle  8 (stall `xmm1`)
      subps    xmm0, xmm1        ; cycle  9
                                 ; cycle 10 (stall `xmm0`)
      

      推荐答案

      TL:DR:寻找依赖链,尤其是循环携带的.对于长时间运行的循环,查看哪些延迟、前端吞吐量或后端端口争用/吞吐量是最严重的瓶颈.如果没有缓存未命中或分支预测错误,那么您的循环平均每次迭代可能需要多少个周期.

      TL:DR: look for dependency chains, especially loop-carried ones. For a long-running loop, see which latency, front-end throughput, or back-end port contention/throughput is the worst bottleneck. That's how many cycles your loop probably takes per iteration, on average, if there are no cache misses or branch mispredicts.

      相关:CPU数量每个汇编指令都需要周期吗?很好地介绍了每个指令的吞吐量与延迟,以及这对多指令序列意味着什么.

      Related: How many CPU cycles are needed for each assembly instruction? is a good introduction to throughput vs. latency on a per-instruction basis, and how what that means for sequences of multiple instructions.

      这称为静态(性能)分析.维基百科说(https://en.wikipedia.org/wiki/List_of_performance_analysis_tools)AMD的AMDCodeXL 有一个静态内核分析器"(即对于计算内核,又名循环).我从来没有尝试过.

      This is called static (performance) analysis. Wikipedia says (https://en.wikipedia.org/wiki/List_of_performance_analysis_tools) that AMD's AMD CodeXL has a "static kernel analyzer" (i.e. for computational kernels, aka loops). I've never tried it.

      英特尔还有一个免费工具用于分析循环如何通过 Sandybridge 系列 CPU 中的管道:什么是 IACA,我该如何使用它?

      Intel also has a free tool for analyzing how loops will go through the pipeline in Sandybridge-family CPUs: What is IACA and how do I use it?

      IACA 还不错,但是有错误(例如 Sandybridge 上 shld 的错误数据,最后我检查了一下,它不知道 Haswell/Skylake 可以为某些指令保留索引寻址模式微融合.但现在也许这会改变,因为英特尔在优化中添加了相关细节手册.)IACA 也无助于计算前端 uop 以了解您离瓶颈有多近(它喜欢只给您未融合域的 uop 计数).

      IACA is not bad, but has bugs (e.g. wrong data for shld on Sandybridge, and last I checked, it doesn't know that Haswell/Skylake can keep indexed addressing modes micro-fused for some instructions. But maybe that will change now that Intel's added details on that to their optimization manual.) IACA is also unhelpful for counting front-end uops to see how close to a bottleneck you are (it likes to only give you unfused-domain uop counts).

      静态分析通常非常好,但一定要通过性能计数器分析来检查.参见 x86 的 MOV 真的可以吗免费"?为什么我根本不能重现这个? 举个例子,分析一个简单的循环来研究微架构特性.

      Static analysis is often pretty good, but definitely check by profiling with performance counters. See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example of profiling a simple loop to investigate a microarchitectural feature.

      Agner Fog 的 微架构指南(第 2 章:乱序执行)解释了依赖的一些基础知识链和乱序执行.他的优化装配"guide 有更多好的介绍和高级性能的东西.

      Agner Fog's microarch guide (chapter 2: Out of order exec) explains some of the basics of dependency chains and out-of-order execution. His "Optimizing Assembly" guide has more good introductory and advanced performance stuff.

      他的微架构指南的后面章节涵盖了像 Nehalem、Sandybridge、Haswell、K8/K10、Bulldozer 和 Ryzen 等 CPU 中的管道的详细信息.(还有 Atom/Silvermont/Jaguar).

      The later chapters of his microarch guide cover the details of the pipelines in CPUs like Nehalem, Sandybridge, Haswell, K8/K10, Bulldozer, and Ryzen. (And Atom / Silvermont / Jaguar).

      Agner Fog 的指令表(电子表格或 PDF)通常也是指令延迟/吞吐量/执行端口故障的最佳来源.

      Agner Fog's instruction tables (spreadsheet or PDF) are also normally the best source for instruction latency / throughput / execution-port breakdowns.

      David Kanter 的微架构分析文档非常好,有图表.例如https://www.realworldtech.com/sandy-bridge/, https://www.realworldtech.com/haswell-cpu/https://www.realworldtech.com/bulldozer/.

      David Kanter's microarch analysis docs are very good, with diagrams. e.g. https://www.realworldtech.com/sandy-bridge/, https://www.realworldtech.com/haswell-cpu/, and https://www.realworldtech.com/bulldozer/.

      另请参阅x86 标签维基中的其他性能链接.

      See also other performance links in the x86 tag wiki.

      我还在 这个答案,但我认为您已经掌握了与调优软件相关的基础知识.不过,我确实提到了 SMT(超线程)如何将更多 ILP 暴露给单个 CPU 内核.

      I also took a stab at explaining how a CPU core finds and exploits instruction-level parallelism in this answer, but I think you've already grasped those basics as far as it's relevant for tuning software. I did mention how SMT (Hyperthreading) works as a way to expose more ILP to a single CPU core, though.

      在英特尔术语中:

      • issue"表示向核心的乱序部分发送一个uop;连同寄存器重命名,这是前端的最后一步.问题/重命名阶段通常是管道中最窄的点,例如自 Core2 以来,在 Intel 上为 4 宽.(由于 SKL 改进的解码器和 uop 缓存带宽,以及后端和缓存带宽的改进,在一些真实代码中,像 Haswell 尤其是 Skylake 这样的 uarch 经常实际上非常接近.)这是融合域 uop: micro-fusion 可以让你通过前端发送 2 uop 并且只占用一个 ROB 条目.(我能够在 Skylake 上构建一个循环,维持 7 个未融合的-每个时钟域 uops).另见 http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ 回复:乱序窗口大小.

      • "issue" means to send a uop into the out-of-order part of the core; along with register-renaming, this is the last step in the front-end. The issue/rename stage is often the narrowest point in the pipeline, e.g. 4-wide on Intel since Core2. (With later uarches like Haswell and especially Skylake often actually coming very close to that in some real code, thanks to SKL's improved decoders and uop-cache bandwidth, as well as back-end and cache bandwidth improvements.) This is fused-domain uops: micro-fusion lets you send 2 uops through the front-end and only take up one ROB entry. (I was able to construct a loop on Skylake that sustains 7 unfused-domain uops per clock). See also http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ re: out-of-order window size.

      dispatch" 表示调度程序向执行端口发送一个 uop.一旦所有输入准备就绪,并且相关的执行端口可用,就会发生这种情况.x86 uops 究竟是如何调度的?.调度发生在未融合"的环境中.领域;微融合 uops 在 OoO 调度器(又名预订站,RS)中单独跟踪.

      "dispatch" means the scheduler sends a uop to an execution port. This happens as soon as all the inputs are ready, and the relevant execution port is available. How are x86 uops scheduled, exactly?. Scheduling happens in the "unfused" domain; micro-fused uops are tracked separately in the OoO scheduler (aka Reservation Station, RS).

      许多其他计算机架构文献在相反的意义上使用这些术语,但这是您可以在英特尔优化手册中找到的术语,以及硬件性能计数器的名称,例如 uops_issued.anyuops_dispatched_port.port_5.

      A lot of other computer-architecture literature uses these terms in the opposite sense, but this is the terminology you will find in Intel's optimization manual, and the names of hardware performance counters like uops_issued.any or uops_dispatched_port.port_5.

      任意算术 x86-64 汇编代码需要多长时间

      exactly how long arbitrary arithmetical x86-64 assembly code will take

      这也取决于周围的代码,因为 OoO exec

      您的最终 subps 结果不必在 CPU 开始运行后续指令之前准备好.延迟仅适用于需要该值作为输入的后续指令,而不适用于整数循环等.

      It depends on the surrounding code as well, because of OoO exec

      Your final subps result doesn't have to be ready before the CPU starts running later instructions. Latency only matters for later instructions that need that value as an input, not for integer looping and whatnot.

      有时吞吐量才是最重要的,乱序 exec 可以隐藏多个独立的短依赖链的延迟.(例如,如果您对多个向量的大数组的每个元素都执行相同的操作,则可以同时执行多个交叉乘积.)即使按程序顺序,您也将同时执行多次迭代在进行任何下一次迭代之前,您完成了所有一次迭代.(如果 OoO exec 很难在硬件中进行所有重新排序,则软件流水线可以帮助处理高延迟循环体.)

      Sometimes throughput is what matters, and out-of-order exec can hide the latency of multiple independent short dependency chains. (e.g. if you're doing the same thing to every element of a big array of multiple vectors, multiple cross products can be in flight at once.) You'll end up with multiple iterations in flight at once, even though in program order you finish all of one iteration before doing any of the next. (Software pipelining can help for high-latency loop bodies if OoO exec has a hard time doing all the reordering in HW.)

      您可以根据这三个因素大致表征一小段非分支代码.通常只有其中一个是给定用例的瓶颈.通常你会看到一个块,你将使用它作为循环的部分,而不是作为整个循环体,但是 OoO exec 通常工作得很好,你可以把这些数字加起来对于几个不同的块,如果它们不是太长以至于 OoO 窗口大小阻止找到所有 ILP.

      You can approximately characterize a short block of non-branching code in terms of these three factors. Usually only one of them is the bottleneck for a given use-case. Often you're looking at a block that you will use as part of a loop, not as the whole loop body, but OoO exec normally works well enough that you can just add up these numbers for a couple different blocks, if they're not so long that OoO window size prevents finding all the ILP.

      • 从每个输入到输出的延迟.查看从每个输入到每个输出的依赖链上有哪些指令.例如一种选择可能需要一种输入才能更快地准备就绪.
      • 总 uop 计数(针对前端吞吐量瓶颈),Intel CPU 上的融合域.例如Core2 和更高版本理论上可以将每个时钟的 4 个融合域 uops 发布/重命名为乱序调度程序/ROB.Sandybridge 系列通常可以通过 uop 缓存和循环缓冲区在实践中实现这一点,尤其是 Skylake 及其改进的解码器和 uop 缓存吞吐量.
      • 每个后端执行端口的 uop 计数(未融合域).例如shuffle-heavy 代码通常会在 Intel CPU 的端口 5 上出现瓶颈.英特尔通常只发布吞吐量数字,而不是端口故障,这就是为什么如果您不只是重复无数次相同的指令,您必须查看 Agner Fog 的表(或 IACA 输出)才能做任何有意义的事情.
      • latency from each input to the output(s). Look at which instructions are on the dependency chain from each input to each output. e.g. one choice might need one input to be ready sooner.
      • total uop count (for front-end throughput bottlenecks), fused-domain on Intel CPUs. e.g. Core2 and later can in theory issue/rename 4 fused-domain uops per clock into the out-of-order scheduler/ROB. Sandybridge-family can often achieve that in practice with the uop cache and loop buffer, especially Skylake with its improved decoders and uop-cache throughput.
      • uop count for each back-end execution port (unfused domain). e.g. shuffle-heavy code will often bottleneck on port 5 on Intel CPUs. Intel usually only publishes throughput numbers, not port breakdowns, which is why you have to look at Agner Fog's tables (or IACA output) to do anything meaningful if you're not just repeating the same instruction a zillion times.

      通常,您可以假设最佳情况下的调度/分发,可以在其他端口上运行的 uops 不会经常窃取繁忙的端口,但确实会发生一些.(x86 uops 究竟是如何调度的?)

      Generally you can assuming best-case scheduling/distribution, with uops that can run on other ports not stealing the busy ports very often, but it does happen some. (How are x86 uops scheduled, exactly?)

      光看CPI是不够的;两条 CPI=1 指令可能会或可能不会竞争 相同 执行端口.如果没有,它们可以并行执行.例如Haswell 只能在端口 0 上运行 psadbw(5c 延迟,1c 吞吐量,即 CPI=1)但它是单个 uop,因此混合了 1 psadbw + 3 add 指令每个时钟可以支持 4 条指令.在 Intel CPU 的 3 个不同端口上有向量 ALU,其中一些操作在所有 3 个端口上复制(例如布尔值),而一些操作仅在一个端口上复制(例如在 Skylake 之前的移位).

      Looking at CPI is not sufficient; two CPI=1 instructions might or might not compete for the same execution port. If they don't, they can execute in parallel. e.g. Haswell can only run psadbw on port 0 (5c latency, 1c throughput, i.e. CPI=1) but it's a single uop so a mix of 1 psadbw + 3 add instructions could sustain 4 instructions per clock. There are vector ALUs on 3 different ports in Intel CPUs, with some operations replicated on all 3 (e.g. booleans) and some only on one port (e.g. shifts before Skylake).

      有时您可以想出几种不同的策略,一种可能会降低延迟但会花费更多的 uops.一个经典的例子是 乘以常数,例如imul eax, ecx, 10(英特尔上为1 uop,3c 延迟)与lea eax, [rcx + rcx*4]/add eax,eax(2 uops,2c 延迟).现代编译器倾向于选择 2 LEA 与 1 IMUL,尽管高达 3.7 更喜欢 IMUL,除非它可以只用一条其他指令完成工作.

      Sometimes you can come up with a couple different strategies, one maybe lower latency but costing more uops. A classic example is multiplying by constants like imul eax, ecx, 10 (1 uop, 3c latency on Intel) vs. lea eax, [rcx + rcx*4] / add eax,eax (2 uops, 2c latency). Modern compilers tend to choose 2 LEA vs. 1 IMUL, although clang up to 3.7 favoured IMUL unless it could get the job done with only a single other instruction.

      计算某个位置或更低位置的设置位的有效方法是什么? 以几种不同方式实现函数的静态分析示例.

      See What is the efficient way to count set bits at a position or lower? for an example of static analysis for a few different ways to implement a function.

      另见为什么mulss 在 Haswell 上只需要 3 个周期,与 Agner 的指令表不同吗?(展开具有多个累加器的 FP 循环)(最终比您从问题标题中猜测的要详细得多)用于静态分析的另一个摘要,以及一些关于使用多个累加器展开以进行缩减的简洁内容.

      See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) (which ended up being way more detailed than you'd guess from the question title) for another summary of static analysis, and some neat stuff about unrolling with multiple accumulators for a reduction.

      每个(?)功能单元都是流水线化的

      Every (?) functional unit is pipelined

      分频器在最近的 CPU 中是流水线化的,但不是完全流水线化的.(不过,FP 除法是单 uop,所以如果你做一个 divps 与几十个 mulps/addps 混合,它可以忽略不计如果延迟无关紧要对吞吐量的影响:浮点除法与浮点乘法.rcpps + 牛顿迭代是更差的吞吐量和大致相同的延迟.

      The divider is pipelined in recent CPUs, but not fully pipelined. (FP divide is single-uop, though, so if you do one divps mixed in with dozens of mulps / addps, it can have negligible throughput impact if latency doesn't matter: Floating point division vs floating point multiplication. rcpps + a Newton iteration is worse throughput and about the same latency.

      其他一切都在主流英特尔 CPU 上完全流水线化;单个 uop 的多周期(互惠)吞吐量.(像 shl eax, cl 这样的可变计数整数移位的 3 uops 吞吐量低于预期,因为它们通过标志合并 uops 创建了依赖关系.但是如果你通过 FLAGS 打破这种依赖关系使用 add 或其他东西,您可以获得 更好的吞吐量和延迟.)

      Everything else is fully pipelined on mainstream Intel CPUs; multi-cycle (reciprocal) throughput for a single uop. (variable-count integer shifts like shl eax, cl have lower-than-expected throughput for their 3 uops, because they create a dependency through the flag-merging uops. But if you break that dependency through FLAGS with an add or something, you can get better throughput and latency.)

      在锐龙之前的 AMD 上,整数乘法器也只是部分流水线化.例如Bulldozer 的 imul ecx, edx 只有 1 uop,但具有 4c 延迟,2c 吞吐量.

      On AMD before Ryzen, the integer multiplier is also only partially pipelined. e.g. Bulldozer's imul ecx, edx is only 1 uop, but with 4c latency, 2c throughput.

      至强融核 (KNL) 也有一些非完全流水线的 shuffle 指令,但它往往会在前端(指令解码)而不是后端出现瓶颈,并且确实具有小缓冲区 + OoO exec 能力隐藏后端气泡.

      Xeon Phi (KNL) also has some not-fully-pipelined shuffle instructions, but it tends to bottleneck on the front-end (instruction decode), not the back-end, and does have a small buffer + OoO exec capability to hide back-end bubbles.

      如果是浮点指令,之前的每条浮点指令都已经发出(浮点指令有静态指令重新排序)

      If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering)

      没有

      也许你读过 Silvermont 的那个,它不为 FP/SIMD 做 OoO exec,只有整数(有一个小的 ~20 uop 窗口).也许一些 ARM 芯片也是这样,为 NEON 提供更简单的调度程序?我不太了解 ARM uarch 的详细信息.

      Maybe you read that for Silvermont, which doesn't do OoO exec for FP/SIMD, only integer (with a small ~20 uop window). Maybe some ARM chips are like that, too, with simpler schedulers for NEON? I don't know much about ARM uarch details.

      主流的大核微架构,如 P6/SnB 系列,以及所有 AMD OoO 芯片,对 SIMD 和 FP 指令的 OoO exec 与整数相同.AMD CPU 使用单独的调度程序,但 Intel 使用统一调度程序,因此其完整大小可用于在整数或 FP 代码中查找 ILP,以当前正在运行的为准.

      The mainstream big-core microarchitectures like P6 / SnB-family, and all AMD OoO chips, do OoO exec for SIMD and FP instructions the same as for integer. AMD CPUs use a separate scheduler, but Intel uses a unified scheduler so its full size can be applied to finding ILP in integer or FP code, whichever is currently running.

      即使是位于 Silvermont 的 Knight's Landing(在 Xeon Phi 中)也为 SIMD 执行 OoO.

      Even the silvermont-based Knight's Landing (in Xeon Phi) does OoO exec for SIMD.

      x86 通常对指令排序不是很敏感,但 uop 调度不做关键路径分析.因此,有时将指令先放在关键路径上可能会有所帮助,因此当其他指令在该端口上运行时,它们不会因为输入准备就绪而陷入等待,从而导致稍后当我们到达需要结果的指令时出现更大的停顿关键路径.(即这就是为什么它是关键路径.)

      x86 is generally not very sensitive to instruction ordering, but uop scheduling doesn't do critical-path analysis. So it could sometimes help to put instructions on the critical path first, so they aren't stuck waiting with their inputs ready while other instructions run on that port, leading to a bigger stall later when we get to instructions that need the result of the critical path. (i.e. that's why it is the critical path.)

      我尝试预测 Haswell 的延迟看起来像这样:

      My attempt to predict the latency for Haswell looks something like this:

      是的,看起来不错.shufps 在端口 5 上运行,addps 在 p1 上运行,mulps 在 p0 或 p1 上运行.Skylake 删除了专用的 FP-add 单元并在 p0/p1 上的 FMA 单元上运行 SIMD FP add/mul/FMA,所有这些都具有 4c 延迟(从 Haswell 中的 3/5/5 或 3/3/5 in布罗德韦尔).

      Yup, that looks right. shufps runs on port 5, addps runs on p1, mulps runs on p0 or p1. Skylake drops the dedicated FP-add unit and runs SIMD FP add/mul/FMA on the FMA units on p0/p1, all with 4c latency (up/down from 3/5/5 in Haswell, or 3/3/5 in Broadwell).

      这是一个很好的例子,说明为什么在 SIMD 向量中保留整个 XYZ 方向向量通常很糟糕. 保留一个 X 数组、一个 Y 数组和一个 Z 数组,会让你并行做 4 个交叉产品,没有任何洗牌.

      This is a good example of why keeping a whole XYZ direction vector in a SIMD vector usually sucks. Keeping an array of X, an array of Y, and an array of Z, would let you do 4 cross products in parallel without any shuffles.

      SSE 标签 wiki 有这些幻灯片的链接:Insomniac Games (GDC 2015) 中的 SIMD ,其中涵盖了一系列-3D 向量的结构与数组结构问题,以及为什么总是尝试 SIMD 单个操作而不是使用 SIMD 并行执行多个操作通常是错误的.

      The SSE tag wiki has a link to these slides: SIMD at Insomniac Games (GDC 2015) which covers that array-of-structs vs. struct-of-arrays issues for 3D vectors, and why it's often a mistake to always try to SIMD a single operation instead of using SIMD to do multiple operations in parallel.

      这篇关于预测现代超标量处理器上的操作的延迟需要考虑哪些因素,我如何手动计算它们?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆