SIMD 指令降低 CPU 频率 [英] SIMD instructions lowering CPU frequency

查看:89
本文介绍了SIMD 指令降低 CPU 频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我读了这个

Normal、AVX2 和 AVX512 行分别对应 L0、L1 和 L2 许可证.请注意,随着内核数量的增加,L1 和 L2 许可证的相对减速通常会变得更糟:对于 1 或 2 个活动内核,L1 和 L2 速度分别是 L0 的 97% 和 91%,但对于 13 或 14 个内核,它们是 85%和 62%.这因芯片而异,但总体趋势通常是相同的.

先不做这些准备工作,让我们来看看我认为您在问什么:哪些说明会导致激活哪些许可证?

这是一个表格,根据指令的宽度和的分类显示了指令的隐含许可:

 宽度 轻 重------- ------- -------标量 L0 不适用128 位 L0 L0256 位 L0 L1*512 位 L1 L2**软过渡(见下文)

所以我们立即看到所有标量(非 SIMD)指令和所有 128 位宽指令2总是在 L0 许可证中全速运行.

256 位指令将在 L0 或 L1 中运行,具体取决于它们是轻量还是,而 512 位指令将在 L1 或 L2 中运行相同的基础.

那么这又轻又重的东西是什么呢?

轻与重

从解释繁重的指令开始是最容易的.

重指令都是需要在FP上运行的SIMD指令/FMA 单位.基本上这是大部分的 FP 指令(那些通常以 pspd 结尾,比如 addpd)以及 integer 乘法指令主要以 vpmulvpmad 开头,因为 SIMD 整数乘法实际上在 SIMD 单元上运行,以及 vplzcnt(q|d) 显然也在 FMA 单元上运行.

鉴于此,简单的说明就是其他一切.尤其是除乘法之外的整数运算、逻辑指令、混洗/混合(包括 FP)和 SIMD 加载和存储是轻量级的.

过渡

Heavy 列中的 L1 和 L2 条目用星号标记,例如 L1*.这是因为这些指令在发生时会导致 转换.另一个 L1 条目(用于 512 位轻型指令)导致硬转换.在这里,我们将讨论这两种转换类型.

硬过渡

只要具有给定许可证的任何指令执行,就会立即发生硬转换4.CPU 停止,需要一些停止周期并进入新模式.

软过渡

与硬转换不同,软转换不会在任何指令执行后立即发生.相反,指令最初以降低的吞吐量(低至正常速率的 1/4)执行,而不会改变频率.如果 CPU 决定每单位时间正在执行足够"的繁重指令,并且达到特定阈值,则会转换到编号更高的许可证.

也就是说,CPU 明白,如果只有少数重指令到达,或者即使许多指令到达但它们并不密集,在考虑其他非重指令时,可能不值得减少频率.

指南

鉴于上述情况,我们可以制定一些合理的指导方针.您永远不必害怕 128 位指令,因为它们永远不会导致与许可证相关的3降频.

此外,您也不必担心轻量 256 位宽指令,因为它们也不会导致降频.如果您不使用大量矢量化 FP 数学,则不太可能使用繁重的指令,因此这适用于您.事实上,当您使用适当的 -march 选项时,编译器已经大量插入 256 位指令,尤其是对于数据移动和自动矢量化循环.

使用重 AVX/AVX2 指令和轻 AVX-512 指令比较棘手,因为您将在 L1 许可证中运行.如果只有一小部分流程(比如 10%)可以利用,那么减慢应用程序的其余部分可能不值得.与 L1 相关的处罚通常是适中的 - 但请检查您的筹码详情.

使用繁重的 AVX-512 指令更加棘手,因为 L2 许可证在大多数芯片上都带有严重的频率损失.另一方面,重要的是要注意只有 FP 和整数乘法指​​令属于类别,因此实际上,许多整数 512 位宽的使用只会招致 L1 许可证.

<小时>

1虽然,正如我们将看到的,这有点用词不当,因为 AVX-512 指令可以将速度设置为此许可证,而某些 AVX/2 指令则不能.

2 128 位宽意味着使用 xmm 寄存器,不管它们被引入什么指令集 - 主流 AVX-512 包含大多数/所有的 128 位变体新说明.

3 请注意黄鼠狼条款许可相关 - 您当然可能会遭受其他降频原因,例如热、功率或电流限制,并且可能 128-位指令可能会触发这种情况,但我认为在台式机或服务器系统上不太可能发生(低功耗、小尺寸设备是另一回事).

4 显然,我们只讨论到更高级别许可的转换,例如,当硬转换 L1 指令执行时,从 L0 到 L1.如果您已经在 L1 或 L2 中,则不会发生任何事情 - 如果您已经在同一级别并且您不会根据任何特定指令过渡到较低编号的级别,而是在没有任何指令的情况下运行一段时间,则不会发生任何转换编号较高的级别.

5 在这两个 AVX2 turbo 中更常见,我从来没有真正理解过,因为与 AVX2 相比,256 位指令与 AVX 的相关性一样多,并且大多数实际触发AVX turbo(L1 许可)的指令实际上是AVX中的FP指令,而不是AVX2.唯一的例外是 AVX2 整数乘法.

I read this article. It talked about why AVX-512 instruction:

Intel’s latest processors have advanced instructions (AVX-512) that may cause the core, or maybe the rest of the CPU to run slower because of how much power they use.

I think on Agner's blog also mentioned something similar (but I can't find the exact post).

I wonder what other instructions supported by Skylake have the similar effect that they will lower the power to maximize the throughput later? All the v prefixed instructions (such as vmovapd, vmulpd, vaddpd, vsubpd, vfmadd213pd)?

I am trying to compile a list of instructions to avoid when compiling my C++ application for Xeon Skylake.

解决方案

The frequency impact depends on the width of the instruction and the instruction used.

There are three frequency levels, so-called licenses, from fastest to slowest: L0, L1 and L2. L0 is the "nominal" speed you'll see written on the box: when the chip says "3.5 GHz turbo", they are referring to the single-core L0 turbo. L1 is a lower speed sometimes called AVX turbo or AVX2 turbo5, originally associated with AVX and AVX2 instructions1. L2 is a lower speed than L1, sometimes called "AVX-512 turbo".

The exact speeds for each license also depend on the number of active cores. For up to date tables, you can usually consult WikiChip. For example, the table for the Xeon Gold 5120 is here:

The Normal, AVX2 and AVX512 rows correspond to the L0, L1 and L2 licenses respectively. Note that the relative slowdown for L1 and L2 licenses generally gets worse as the number of cores increase: for 1 or 2 active cores the L1 and L2 speeds are 97% and 91% of L0, but for 13 or 14 cores they are 85% and 62% respectively. This varies by chip, but the general trend is usually the same.

Those preliminaries out of the way, let's get to what I think you are asking: which instructions cause which licenses to be activated?

Here's a table, showing the implied license for instructions based on their width and their categorization as light or heavy:

   Width    Light   Heavy  
 --------- ------- ------- 
  Scalar    L0      N/A
  128-bit   L0      L0     
  256-bit   L0      L1*    
  512-bit   L1      L2*

*soft transition (see below)

So we immediately see that all scalar (non-SIMD) instructions and all 128-bit wide instructions2 always run at full speed in the L0 license.

256-bit instructions will run in L0 or L1, depending on whether they are light or heavy, and 512-bit instructions will run in L1 or L2 on the same basis.

So what is this light and heavy thing?

Light vs Heavy

It's easiest to start by explaining heavy instructions.

Heavy instructions are all SIMD instructions that need to run on the FP/FMA unit. Basically that's the majority of the FP instructions (those usually ending in ps or pd, like addpd) as well as integer multiplication instructions which largely start with vpmul or vpmad since SIMD integer multiplication actually runs on the SIMD unit, as well as vplzcnt(q|d) which apparently also runs on the FMA unit.

Given that, light instructions are everything else. In particular, integer arithmetic other than multiplication, logical instructions, shuffles/blends (including FP) and SIMD load and store are light.

Transitions

The L1 and L2 entries in the Heavy column are marked with an asterisk, like L1*. That's because these instructions cause a soft transition when they occur. The other L1 entry (for 512-bit light instructions) causes a hard transition. Here we'll discuss the two transition types.

Hard Transition

A hard transition occurs immediately as soon as any instruction with the given license executes4. The CPU stops, takes some halt cycles and enters the new mode.

Soft Transition

Unlike hard transitions, a soft transition doesn't occur immediately as soon as any instruction is executed. Rather, the instructions initially execute with a reduced throughput (as slow as 1/4 their normal rate), without changing the frequency. If the CPU decides that "enough" heavy instructions are executing per unit time, and a specific threshold is reached, a transition to the higher-numbered license occurs.

That is, the CPU understands that if only a few heavy instructions arrive, or even if many arrive but they aren't dense when considering other non-heavy instructions, it may not be worth reducing the frequency.

Guidelines

Given the above, we can establish some reasonable guidelines. You never have to be scared of 128-bit instructions, since they never cause license related3 downclocking.

Furthermore, you never have to be worried about light 256-bit wide instructions either, since they also don't cause downclocking. If you aren't using a lot of vectorized FP math, you aren't likely to be using heavy instructions, so this would apply to you. Indeed, compilers already liberally insert 256-bit instructions when you use the appropriate -march option, especially for data movement and auto-vectorized loops.

Using heavy AVX/AVX2 instructions and light AVX-512 instructions is trickier, because you will run in the L1 licenses. If only a small part of your process (say 10%) can take advantage, it probably isn't worth slowing down the rest of your application. The penalties associated with L1 are generally moderate - but check the details for your chip.

Using heavy AVX-512 instructions is even trickier, because the L2 license comes with serious frequency penalties on most chips. On the other hand, it is important to note that only FP and integer multiply instructions fall into the heavy category, so as a practical matter a lot of integer 512-bit wide use will only incur the L1 license.


1 Although, as we'll see, this a bit of a misnomer because AVX-512 instructions can set the speed to this license, and some AVX/2 instructions don't.

2 128-bit wide means using xmm registers, regardless of what instruction set they were introduced in - mainstream AVX-512 contains 128-bit variants for most/all new instructions.

3 Note the weasel clause license related - you may certainly suffer other causes of downclocking, such as thermal, power or current limits, and it is possible that 128-bit instructions could trigger this, but I think it is fairly unlikely on a desktop or server system (low power, small form factor devices are another matter).

4 Evidently, we are talking only about transitions to a higher-level license, e.g., from L0 to L1 when a hard-transition L1 instruction executes. If you are already in L1 or L2 nothing happens - there is no transition if you are already in the same level and you don't transition to lower-numbered levels based on any specific instruction but rather running for a certain time without any instructions of the higher-numbered level.

5 Out of the two AVX2 turbo is more common, which I never really understood because 256-bit instructions are as much associated with AVX as compared to AVX2, and most of the heavy instructions which actually trigger AVX turbo (L1 license) are actually FP instructions in AVX, not AVX2. The only exception is AVX2 integer multiplies.

这篇关于SIMD 指令降低 CPU 频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆