沙桥和haswell SSE2/AVX/AVX2的每个周期FLOPS [英] FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

查看:211
本文介绍了沙桥和haswell SSE2/AVX/AVX2的每个周期FLOPS的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对使用Sandy-Bridge和Haswell可以在每个内核每个周期完成多少次触发器感到困惑. 据我了解,对于SSE,对于SSE,每个内核每个周期应为4触发器,对于AVX/AVX2,每个内核每个周期应为8触发器.

这似乎在这里得到验证, 如何在理论上实现每个周期最多4个FLOP? ,和这里, Sandy-Bridge CPU规范.

但是,下面的链接似乎表明Sandy-bridge每个内核每个周期可以完成16次触发器,Haswell每个内核每个周期可以完成32次触发器 http://www.extremetech .com/computing/136219-intels-haswell是对英伟达AMD的前所未有的威胁.

有人可以向我解释吗?

我现在明白为什么我感到困惑了.我认为术语FLOP仅指单个浮点(SP).现在,我看到该测试位于我如何才能实现理论上每个FLOP最多达到4个周期?实际上是在双浮点(DP)上,所以对于SSE,每个周期它们达到4 DP FLOP,对于AVX,它们达到8 DP DPOP/周期.在SP上重做这些测试会很有趣.

解决方案

以下是许多近期处理器微体系结构的理论最大FLOP计数(每个内核),并说明了如何实现它们.

通常,要计算此值,请查询FMA指令的吞吐量,例如在 https://agner.org/optimize/或任何其他微基准测试结果上,相乘
(FMAs per clock) * (vector elements / instruction) * 2 (FLOPs / FMA).
请注意,在实际代码中实现此功能需要非常仔细的调整(例如循环展开),接近零的缓存未命中以及在 else 上没有任何瓶颈.现代CPU具有如此高的FMA吞吐量,以至于其他指令没有太多空间来存储结果或向其提供输入.例如每个时钟2 SIMD负载也是大多数x86 CPU的限制,因此每1 FMA 2点负载将成为点积瓶颈.不过,精心调整的密集矩阵乘法可能接近达到这些数字.

如果您的工作负载中包含无法签入FMA的任何ADD/SUB或MUL,则理论上的最大数量不是您的工作负载的适当目标. Haswell/Broadwell每时钟具有2个SIMD FP乘法(在FMA单元上),但每个时钟SIMD FP仅添加1个(在具有较低延迟的独立矢量FP加法单元上). Skylake放弃了单独的SIMD FP加法器,对任何矢量宽度,都以4c延迟,2个时钟的吞吐量运行add/mul/fma相同.

英特尔

请注意,最近的微体系结构的Celeron/Pentium版本不支持AVX或FMA指令,仅支持SSE4.2.

Intel Core 2和Nehalem(SSE/SSE2):

  • 每个周期4个DP FLOP:2宽SSE2加法+ 2宽SSE2乘法
  • 每个周期8个SP FLOP:4级SSE加法+ 4级SSE乘法

Intel Sandy Bridge/Ivy Bridge(AVX1):

  • 每个周期8个DP FLOP:4宽AVX加+ 4宽AVX乘法
  • 每个周期16个SP FLOP:8宽AVX加法+ 8宽AVX乘法

Intel Haswell/Broadwell/Skylake/Kaby Lake/Coffee/...(AVX + FMA3):

  • 每个周期16个DP FLOP:两条4宽FMA(融合乘法加法)指令
  • 每个周期32个SP FLOP:两个8宽FMA(融合乘法加法)指令
  • (使用256位矢量指令可能会降低某些CPU的最大Turbo时钟速度.)

英特尔Skylake-X/Skylake-EP/Cascade Lake/等( AVX512F ),具有 1个FMA单元:一些至强铜牌/银色

  • 每个周期16个DP FLOP:一条8宽FMA(融合乘法加法)指令
  • 每个周期32个SP FLOP:一条16宽FMA(融合乘法加法)指令
  • 与较窄的256位指令相同的计算吞吐量,但使用AVX512仍可实现更快的加载/存储速度,一些未在FMA单元上运行的矢量操作(如按位运算)和更广泛的混洗./li>
  • (正在运行512位矢量指令会关闭端口1上的矢量ALU.另外,会降低最大Turbo时钟速度,因此在性能计算中周期"不是一个常数. )

Intel Skylake-X/Skylake-EP/Cascade Lake/etc( AVX512F ),具有 2个FMA单元:Xeon Gold/Platinum和i7/i9高端台式机(HEDT)芯片.

  • 每个周期32个DP FLOP:两条8宽FMA(融合乘法加法)指令
  • 每个周期64个SP FLOP:两条16宽FMA(融合乘法加法)指令
  • (运行中具有512位矢量指令会关闭端口1上的矢量ALU.还会降低最大Turbo时钟速度.)

未来:英特尔 Cooper Lake (Cascade Lake的前身)有望引入 Brain Float ,一种适用于神经网络工作负载的float16格式,并支持实际的SIMD计算在它上面,不同于当前的F16C扩展,它仅支持加载/存储并转换为float32.与同一个硬件上的单精度相比,这将使FLOP/周期的吞吐量提高一倍.

当前的Intel芯片只能直接在iGPU中的标准float16上进行实际计算.


AMD

AMD K10:

  • 每个周期4个DP FLOP:2宽SSE2加法+ 2宽SSE2乘法
  • 每个周期8个SP FLOP:4级SSE加法+ 4级SSE乘法

AMD推土机/打桩机/压路机/挖掘机,每个模块(两个内核):

  • 每周期8个DP FLOP:4宽FMA
  • 每个周期16个SP FLOP:8宽FMA

AMD锐龙

  • 每周期8个DP FLOP:4宽FMA
  • 每个周期16个SP FLOP:8宽FMA

x86低功耗

Intel Atom(Bonnell/45nm,Saltwell/32nm,Silvermont/22nm):

  • 每周期1.5个DP FLOP:标量SSE2加法+标量SSE2乘法每隔一个周期
  • 每周期6个SP FLOP:每隔4个周期加4倍SSE加+ 4倍SSE乘法

AMD山猫:

  • 每周期1.5个DP FLOP:标量SSE2加法+标量SSE2乘法每隔一个周期
  • 每个周期4个SP FLOP:每隔一个周期加4宽SSE +每隔一个周期加4宽SSE乘法

AMD Jaguar:

  • 每周期3个DP FLOP:每隔一个周期加4宽AVX +在四个周期中进行4宽AVX乘法
  • 每个周期8个SP FLOP:每隔一个周期加8宽AVX +每隔一个周期加8宽AVX乘法


ARM

ARM Cortex-A9:

  • 每个周期1.5个DP FLOP:每隔一个周期进行标量加法和标量乘法
  • 每周期4个SP FLOP:每隔一个周期加4宽NEON +每隔一个周期加4宽NEON乘法

ARM Cortex-A15:

  • 每个周期2个DP FLOP:标量FMA或标量乘加
  • 每个周期8个SP FLOP:4宽NEONv2 FMA或4宽NEON乘法加法

高通Krait:

  • 每个周期2个DP FLOP:标量FMA或标量乘加
  • 每个周期8个SP FLOP:4宽NEONv2 FMA或4宽NEON乘法加法

IBM POWER

每个内核IBM PowerPC A2(蓝色Gene/Q):

  • 每个周期8个DP FLOP:每个周期4个宽幅QPX FMA
  • SP元素扩展到DP并在相同的单元上进行处理

每个线程IBM PowerPC A2(蓝色Gene/Q):

  • 每个周期4个DP FLOP:每隔一个周期有4个宽幅QPX FMA
  • SP元素扩展到DP并在相同的单元上进行处理

Intel MIC/Xeon Phi

每核Intel Xeon Phi(骑士之角):

  • 每个周期16个DP FLOP:每个周期8宽FMA
  • 每个周期32个SP FLOP:每个周期16个宽FMA

Intel Xeon Phi(骑士角),按线程:

  • 每个周期8个DP FLOP:每隔一个周期8个宽FMA
  • 每个周期16个SP FLOP:每隔16个周期的FMA

每核Intel Xeon Phi(骑士登陆):

  • 每个周期32个DP FLOP:每个周期两个8位宽FMA
  • 每个周期64个SP FLOP:每个周期两个16宽FMA

之所以对IBM Blue Gene/Q和Intel Xeon Phi(骑士之角)提供按线程和按核的数据,是因为当每个内核运行多个线程时,这些内核具有更高的指令发出率./p>

I'm confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell. As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core for AVX/AVX2.

This seems to be verified here, How do I achieve the theoretical maximum of 4 FLOPs per cycle? ,and here, Sandy-Bridge CPU specification.

However the link below seems to indicate that Sandy-bridge can do 16 flops per cycle per core and Haswell 32 flops per cycle per core http://www.extremetech.com/computing/136219-intels-haswell-is-an-unprecedented-threat-to-nvidia-amd.

Can someone explain this to me?

Edit: I understand now why I was confused. I thought the term FLOP only referred to single floating point (SP). I see now that the test at How do I achieve the theoretical maximum of 4 FLOPs per cycle? are actually on double floating point (DP) so they achieve 4 DP FLOPs/cycle for SSE and 8 DP FLOPs/cycle for AVX. It would be interesting to redo these test on SP.

解决方案

Here are theoretical max FLOPs counts (per core) for a number of recent processor microarchitectures and explanation how to achieve them.

In general, to calculate this look up the throughput of the FMA instruction(s) e.g. on https://agner.org/optimize/ or any other microbenchmark result, and multiply
(FMAs per clock) * (vector elements / instruction) * 2 (FLOPs / FMA).
Note that achieving this in real code requires very careful tuning (like loop unrolling), and near-zero cache misses, and no bottlenecks on anything else. Modern CPUs have such high FMA throughput that there isn't much room for other instructions to store the results, or to feed them with input. e.g. 2 SIMD loads per clock is also the limit for most x86 CPUs, so a dot product will bottleneck on 2 loads per 1 FMA. A carefully-tuned dense matrix multiply can come close to achieving these numbers, though.

If your workload includes any ADD/SUB or MUL that can't be contracted into FMAs, the theoretical max numbers aren't an appropriate goal for your workload. Haswell/Broadwell have 2-per-clock SIMD FP multiply (on the FMA units), but only 1 per clock SIMD FP add (on a separate vector FP add unit with lower latency). Skylake dropped the separate SIMD FP adder, running add/mul/fma the same at 4c latency, 2-per-clock throughput, for any vector width.

Intel

Note that Celeron/Pentium versions of recent microarchitectures don't support AVX or FMA instructions, only SSE4.2.

Intel Core 2 and Nehalem (SSE/SSE2):

  • 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
  • 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication

Intel Sandy Bridge/Ivy Bridge (AVX1):

  • 8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication
  • 16 SP FLOPs/cycle: 8-wide AVX addition + 8-wide AVX multiplication

Intel Haswell/Broadwell/Skylake/Kaby Lake/Coffee/... (AVX+FMA3):

  • 16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions
  • 32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
  • (Using 256-bit vector instructions can reduce max turbo clock speed on some CPUs.)

Intel Skylake-X/Skylake-EP/Cascade Lake/etc (AVX512F) with 1 FMA units: some Xeon Bronze/Silver

  • 16 DP FLOPs/cycle: one 8-wide FMA (fused multiply-add) instruction
  • 32 SP FLOPs/cycle: one 16-wide FMA (fused multiply-add) instruction
  • Same computation throughput as with narrower 256-bit instructions, but speedups can still be possible with AVX512 for wider loads/stores, a few vector operations that don't run on the FMA units like bitwise operations, and wider shuffles.
  • (Having 512-bit vector instructions in flight shuts down the vector ALU on port 1. Also reduces the max turbo clock speed, so "cycles" isn't a constant in your performance calculations.)

Intel Skylake-X/Skylake-EP/Cascade Lake/etc (AVX512F) with 2 FMA units: Xeon Gold/Platinum, and i7/i9 high-end desktop (HEDT) chips.

  • 32 DP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
  • 64 SP FLOPs/cycle: two 16-wide FMA (fused multiply-add) instructions
  • (Having 512-bit vector instructions in flight shuts down the vector ALU on port 1. Also reduces the max turbo clock speed.)

Future: Intel Cooper Lake (successor to Cascade Lake) is expected to introduce Brain Float, a float16 format for neural-network workloads, with support for actual SIMD computation on it, unlike the current F16C extension that only has support for load/store with conversion to float32. This should double the FLOP/cycle throughput vs. single-precision on the same hardware.

Current Intel chips only have actual computation directly on standard float16 in the iGPU.


AMD

AMD K10:

  • 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
  • 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication

AMD Bulldozer/Piledriver/Steamroller/Excavator, per module (two cores):

  • 8 DP FLOPs/cycle: 4-wide FMA
  • 16 SP FLOPs/cycle: 8-wide FMA

AMD Ryzen

  • 8 DP FLOPs/cycle: 4-wide FMA
  • 16 SP FLOPs/cycle: 8-wide FMA

x86 low power

Intel Atom (Bonnell/45nm, Saltwell/32nm, Silvermont/22nm):

  • 1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
  • 6 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication every other cycle

AMD Bobcat:

  • 1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
  • 4 SP FLOPs/cycle: 4-wide SSE addition every other cycle + 4-wide SSE multiplication every other cycle

AMD Jaguar:

  • 3 DP FLOPs/cycle: 4-wide AVX addition every other cycle + 4-wide AVX multiplication in four cycles
  • 8 SP FLOPs/cycle: 8-wide AVX addition every other cycle + 8-wide AVX multiplication every other cycle


ARM

ARM Cortex-A9:

  • 1.5 DP FLOPs/cycle: scalar addition + scalar multiplication every other cycle
  • 4 SP FLOPs/cycle: 4-wide NEON addition every other cycle + 4-wide NEON multiplication every other cycle

ARM Cortex-A15:

  • 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add
  • 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add

Qualcomm Krait:

  • 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add
  • 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add

IBM POWER

IBM PowerPC A2 (Blue Gene/Q), per core:

  • 8 DP FLOPs/cycle: 4-wide QPX FMA every cycle
  • SP elements are extended to DP and processed on the same units

IBM PowerPC A2 (Blue Gene/Q), per thread:

  • 4 DP FLOPs/cycle: 4-wide QPX FMA every other cycle
  • SP elements are extended to DP and processed on the same units

Intel MIC / Xeon Phi

Intel Xeon Phi (Knights Corner), per core:

  • 16 DP FLOPs/cycle: 8-wide FMA every cycle
  • 32 SP FLOPs/cycle: 16-wide FMA every cycle

Intel Xeon Phi (Knights Corner), per thread:

  • 8 DP FLOPs/cycle: 8-wide FMA every other cycle
  • 16 SP FLOPs/cycle: 16-wide FMA every other cycle

Intel Xeon Phi (Knights Landing), per core:

  • 32 DP FLOPs/cycle: two 8-wide FMA every cycle
  • 64 SP FLOPs/cycle: two 16-wide FMA every cycle

The reason why there are per-thread and per-core datum for IBM Blue Gene/Q and Intel Xeon Phi (Knights Corner) is that these cores have a higher instruction issue rate when running more than one thread per core.

这篇关于沙桥和haswell SSE2/AVX/AVX2的每个周期FLOPS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆