SIMD指令降低CPU频率 [英] SIMD instructions lowering CPU frequency

查看:220
本文介绍了SIMD指令降低CPU频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我阅读了此 Agner的博客中也提到了类似的内容(但我找不到确切的帖子). /p>

我想知道Skylake支持的其他哪些指令具有类似的效果,它们会降低功耗以在以后最大化吞吐量?所有带v前缀的指令(例如vmovapdvmulpdvaddpdvsubpdvfmadd213pd)?

我正在尝试编译指令列表,以避免在为至强Skylake编译C ++应用程序时使用.

解决方案

频率影响取决于指令所使用指令的 width .

从最快到最慢,共有三个频率级别,即所谓的许可证:L0,L1和L2. L0是您在包装盒上看到的标称"速度:当芯片显示"3.5 GHz Turbo"时,它们指的是单核L0 turbo. L1是有时称为 AVX turbo AVX2 turbo 5 的较低速度,最初与AVX和AVX2指令 1 相关联. L2的速度低于L1,有时也称为"AVX-512涡轮".

每个许可证的确切速度还取决于活动内核的数量.对于最新表,通常可以查阅 WikiChip .例如,Xeon Gold 5120的表格是此处:

普通","AVX2"和"AVX512"行分别对应于L0,L1和L2许可证.请注意,随着核心数量的增加,L1和L2许可证的相对速度通常会变得更糟:对于1个或2个活动核心,L1和L2速度分别是L0的97%和91%,但是对于13或14个核心,它们的速度是85%和62%.这因芯片而异,但总体趋势通常是相同的.

那些初步的建议,让我们了解一下我想问的问题:哪些指令导致激活哪些许可证?

这是一张表格,根据其宽度和 light heavy 的分类,显示了隐含的许可说明:

   Width    Light   Heavy  
 --------- ------- ------- 
  Scalar    L0      N/A
  128-bit   L0      L0     
  256-bit   L0      L1*    
  512-bit   L1      L2*

*soft transition (see below)

因此,我们立即看到所有标量(非SIMD)指令和所有128位宽指令 2 在L0许可证中始终以全速运行.

256位指令将在L0或L1中运行,具体取决于它们是 light 还是 heavy ,而512位指令将在L0或L1上运行在L1或L2上.相同的基础.

那轻巧的东西是什么?

轻与重

从解释繁重的说明开始是最容易的.

重指令是需要在FP/ FMA 单元.基本上,这是大多数FP指令(通常以pspd结尾的指令,例如addpd)以及整数乘法指令,这些指令大多以vpmulvpmad开头因为SIMD整数乘法实际上在SIMD单元上运行,而vplzcnt(q|d)显然也在FMA单元上运行.

鉴于此,简单的说明就是其他所有内容.特别是,除乘法,逻辑指令,混洗/混和(包括FP)以及SIMD加载和存储以外的整数算术都很轻巧.

转换

重载列中的L1和L2条目标有星号,例如L1*.这是因为这些指令在发生时会导致 soft 转换.另一个L1条目(用于512位光指令)导致硬转换.在这里,我们将讨论两种过渡类型.

硬过渡

只要具有给定许可证的任何指令执行 4 ,就会立即发生硬过渡. CPU停止运行,花费一些暂停周期,然后进入新模式.

软过渡

与硬过渡不同,软过渡不会在执行任何指令后立即发生.相反,指令最初执行时的吞吐量降低了(仅为正常速率的1/4),而没有更改频率.如果CPU决定每单位时间正在执行足够多"的重指令 ,并且达到了特定的阈值,则会发生向更高编号许可证的过渡.

也就是说,CPU理解,在考虑其他非繁重的指令时,如果只有很少的繁重指令到达,或者即使有很多到达但它们不是密集,则可能不值得减少频率.

指南

鉴于以上所述,我们可以建立一些合理的准则.您永远不必害怕128位指令,因为它们永远不会导致与许可证相关的 3 降频.

此外,您也不必担心 light 256位宽的指令,因为它们也不会引起降频.如果您不使用大量矢量化FP数学,则您不太可能使用繁重的指令,因此这对您适用.确实,当您使用适当的-march选项时,编译器已经可以自由地插入256位指令,尤其是对于数据移动和自动向量化循环.

使用繁重的AVX/AVX2指令和简便的AVX-512指令比较麻烦,因为您将以L1许可证运行.如果只有一小部分流程(例如10%)可以利用,那么减慢其余应用程序的时间可能就不值得了.与L1相关的罚款通常是中等的-但请检查您的筹码的详细信息.

使用繁重的AVX-512指令甚至更加棘手,因为L2许可证在大多数芯片上都会受到严厉的频率处罚.另一方面,需要注意的是,只有FP和整数乘法指​​令属于 heavy 类别,因此实际上很多512位整数的广泛使用只会产生L1许可证


1 尽管,正如我们将看到的那样,这有点用词不当,因为AVX-512指令可以将速度设置为此许可证,而某些AVX/2指令则不能. >

2 128位宽意味着使用xmm寄存器,无论它们引入什么指令集,无论 -主流AVX-512都包含大多数/全部的128位变体新说明.

3 请注意,黄鼠狼条款与许可证相关-您肯定会遭受其他导致时钟降低的原因,例如热量,功率或电流限制,并且128-位指令可能会触发此操作,但是我认为在台式机或服务器系统上这不太可能(低功耗,小型设备是另一回事).

4 显然,我们仅在谈论执行硬转换L1指令时从更高级别的许可证过渡,例如从L0到L1.如果您已经在L1或L2中,则什么也不会发生-如果您已经处于同一级别,并且您不会根据任何特定指令转换到较低编号的级别,而是在没有任何指令的情况下运行一定时间,则不会进行转换编号较高的级别.

在两个 AVX2 turbo 中,

5 更为常见,我从未真正理解过,因为与AVX2相比,与256位指令相关的AVX数量更多,并且实际上触发 AVX turbo (L1许可证)的大多数 heavy 指令实际上是AVX中的FP指令,而不是AVX2.唯一的例外是AVX2整数乘法.

I read this article. It talked about why AVX-512 instruction:

Intel’s latest processors have advanced instructions (AVX-512) that may cause the core, or maybe the rest of the CPU to run slower because of how much power they use.

I think on Agner's blog also mentioned something similar (but I can't find the exact post).

I wonder what other instructions supported by Skylake have the similar effect that they will lower the power to maximize the throughput later? All the v prefixed instructions (such as vmovapd, vmulpd, vaddpd, vsubpd, vfmadd213pd)?

I am trying to compile a list of instructions to avoid when compiling my C++ application for Xeon Skylake.

解决方案

The frequency impact depends on the width of the instruction and the instruction used.

There are three frequency levels, so-called licenses, from fastest to slowest: L0, L1 and L2. L0 is the "nominal" speed you'll see written on the box: when the chip says "3.5 GHz turbo", they are referring to the single-core L0 turbo. L1 is a lower speed sometimes called AVX turbo or AVX2 turbo5, originally associated with AVX and AVX2 instructions1. L2 is a lower speed than L1, sometimes called "AVX-512 turbo".

The exact speeds for each license also depend on the number of active cores. For up to date tables, you can usually consult WikiChip. For example, the table for the Xeon Gold 5120 is here:

The Normal, AVX2 and AVX512 rows correspond to the L0, L1 and L2 licenses respectively. Note that the relative slowdown for L1 and L2 licenses generally gets worse as the number of cores increase: for 1 or 2 active cores the L1 and L2 speeds are 97% and 91% of L0, but for 13 or 14 cores they are 85% and 62% respectively. This varies by chip, but the general trend is usually the same.

Those preliminaries out of the way, let's get to what I think you are asking: which instructions cause which licenses to be activated?

Here's a table, showing the implied license for instructions based on their width and their categorization as light or heavy:

   Width    Light   Heavy  
 --------- ------- ------- 
  Scalar    L0      N/A
  128-bit   L0      L0     
  256-bit   L0      L1*    
  512-bit   L1      L2*

*soft transition (see below)

So we immediately see that all scalar (non-SIMD) instructions and all 128-bit wide instructions2 always run at full speed in the L0 license.

256-bit instructions will run in L0 or L1, depending on whether they are light or heavy, and 512-bit instructions will run in L1 or L2 on the same basis.

So what is this light and heavy thing?

Light vs Heavy

It's easiest to start by explaining heavy instructions.

Heavy instructions are all SIMD instructions that need to run on the FP/FMA unit. Basically that's the majority of the FP instructions (those usually ending in ps or pd, like addpd) as well as integer multiplication instructions which largely start with vpmul or vpmad since SIMD integer multiplication actually runs on the SIMD unit, as well as vplzcnt(q|d) which apparently also runs on the FMA unit.

Given that, light instructions are everything else. In particular, integer arithmetic other than multiplication, logical instructions, shuffles/blends (including FP) and SIMD load and store are light.

Transitions

The L1 and L2 entries in the Heavy column are marked with an asterisk, like L1*. That's because these instructions cause a soft transition when they occur. The other L1 entry (for 512-bit light instructions) causes a hard transition. Here we'll discuss the two transition types.

Hard Transition

A hard transition occurs immediately as soon as any instruction with the given license executes4. The CPU stops, takes some halt cycles and enters the new mode.

Soft Transition

Unlike hard transitions, a soft transition doesn't occur immediately as soon as any instruction is executed. Rather, the instructions initially execute with a reduced throughput (as slow as 1/4 their normal rate), without changing the frequency. If the CPU decides that "enough" heavy instructions are executing per unit time, and a specific threshold is reached, a transition to the higher-numbered license occurs.

That is, the CPU understands that if only a few heavy instructions arrive, or even if many arrive but they aren't dense when considering other non-heavy instructions, it may not be worth reducing the frequency.

Guidelines

Given the above, we can establish some reasonable guidelines. You never have to be scared of 128-bit instructions, since they never cause license related3 downclocking.

Furthermore, you never have to be worried about light 256-bit wide instructions either, since they also don't cause downclocking. If you aren't using a lot of vectorized FP math, you aren't likely to be using heavy instructions, so this would apply to you. Indeed, compilers already liberally insert 256-bit instructions when you use the appropriate -march option, especially for data movement and auto-vectorized loops.

Using heavy AVX/AVX2 instructions and light AVX-512 instructions is trickier, because you will run in the L1 licenses. If only a small part of your process (say 10%) can take advantage, it probably isn't worth slowing down the rest of your application. The penalties associated with L1 are generally moderate - but check the details for your chip.

Using heavy AVX-512 instructions is even trickier, because the L2 license comes with serious frequency penalties on most chips. On the other hand, it is important to note that only FP and integer multiply instructions fall into the heavy category, so as a practical matter a lot of integer 512-bit wide use will only incur the L1 license.


1 Although, as we'll see, this a bit of a misnomer because AVX-512 instructions can set the speed to this license, and some AVX/2 instructions don't.

2 128-bit wide means using xmm registers, regardless of what instruction set they were introduced in - mainstream AVX-512 contains 128-bit variants for most/all new instructions.

3 Note the weasel clause license related - you may certainly suffer other causes of downclocking, such as thermal, power or current limits, and it is possible that 128-bit instructions could trigger this, but I think it is fairly unlikely on a desktop or server system (low power, small form factor devices are another matter).

4 Evidently, we are talking only about transitions to a higher-level license, e.g., from L0 to L1 when a hard-transition L1 instruction executes. If you are already in L1 or L2 nothing happens - there is no transition if you are already in the same level and you don't transition to lower-numbered levels based on any specific instruction but rather running for a certain time without any instructions of the higher-numbered level.

5 Out of the two AVX2 turbo is more common, which I never really understood because 256-bit instructions are as much associated with AVX as compared to AVX2, and most of the heavy instructions which actually trigger AVX turbo (L1 license) are actually FP instructions in AVX, not AVX2. The only exception is AVX2 integer multiplies.

这篇关于SIMD指令降低CPU频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆