对于许多SIMD算法,为什么只有AVX的处理器性能要优于AVX2处理器? [英] Why do processors with only AVX out-perform AVX2 processors for many SIMD algorithms?

查看:224
本文介绍了对于许多SIMD算法,为什么只有AVX的处理器性能要优于AVX2处理器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在研究C#和C ++中SIMD算法的优势,发现在许多情况下,在AVX处理器上使用128位寄存器要比在带有AVX2的处理器上使用256位寄存器更好.我不明白为什么.

I've been investigating the benefits of SIMD algorithms in C# and C++, and found that in many cases using 128-bit registers on an AVX processor offers a better improvement than using 256-bit registers on a processor with AVX2, but I don't understand why.

通过改进,我的意思是相对于同一台计算机上的非SIMD算法,SIMD算法的提速.

By improvement I mean the speed-up of a SIMD algorithm relative to a non-SIMD algorithm on the same machine.

推荐答案

在AVX处理器上,当不执行AVX指令时,CPU将关闭256位寄存器和浮点单元的上半部分(VEX编码的操作码) .当代码确实使用AVX指令时,CPU必须为FP单元加电-这大约需要70微秒,在此期间AVX指令实际上是使用128微操作执行两次的.

On an AVX processor, the upper half of the 256 bit registers and floating point units are powered down by the CPU when not executing AVX instructons (VEX encoded opcodes). When code does use AVX instructions, the CPU has to power up the FP units - this takes about 70 microseconds, during which time AVX instructions are actually executed using 128 micro-ops twice.

大约700微秒内未使用AVX指令时,CPU会再次关闭电路的上半部分.

When AVX instructions haven't been used for about 700 microseconds, the CPU powers down the upper half of the circuitry again.

现在这样做是因为电路的上半部分会消耗功率(doh!),从而产生热量(双倍doh!).这意味着使用AVX指令时,CPU的运行温度更高.因此,考虑到CPU可以在有散热空间的情况下加速提升",因此使用AVX指令可以减少这种机会,实际上,CPU实际上降低了基本时钟速度".因此,例如,如果您拥有官方时钟频率为2.3GHz的CPU,并且可以将其涡轮增压提升至2.7,那么当您开始使用AVX指令时,该芯片的时钟频率将降至2.1,仅提高至2.3,在极端情况下,基本时钟可能会降低.减少到1.9(请参见

Now it does this because the upper half of the circuitry consumes power (doh!), and so generates heat (double doh!). This means that the CPU runs hotter when AVX instructions are used. So given that CPUs can "turbo boost" when they have thermal headroom, using AVX instructions reduces this chance, and in fact, the CPU actually reduces the "base clock speed". So if you have, for example, a CPU officially clocked at 2.3GHz that can turbo boost to 2.7, when you start using AVX instructions, the chip is clocked down to 2.1 and boosted to only 2.3, and in extreme cases the base clock may be reduced to 1.9 (see pages 2-4 of this).

在此阶段,与不使用AVX指令时相比,您的CPU执行的所有指令大约慢10-15%,甚至可能慢20%.如果您要执行许多SIMD操作,则256位宽的指令值得这样做.但是,如果您要执行几条AVX指令,然后执行正常"代码,然后再执行一点AVX,那么这种时钟速度损失的成本将比仅从AVX获得的所有收益要高.

At this stage, your CPU is executing ALL instructions about 10-15%, maybe even 20% SLOWER than when not using AVX instructions. If you're doing loads of SIMD operations, the 256 bit wide instructions make this worthwhile. But if you're doing a few AVX instructions, then "normal" code, then a bit of AVX again, then this clock speed penalty will cost more than all the gains you can make from AVX alone.

这就是为什么除非您长时间进行密集的以SIMD为主的操作的情况,否则128位宽的SIMD可以比256位宽的运行速度快的原因.使用其余的硅片需要付出一定的代价……(或者更准确地说,是不使用硅片的回报,我们有时会忘记得到了.)

This can be why 128 bit wide SIMD can run faster than 256 bit wide unless you've got lengthy intensive bursts of SIMD-dominated operations. There is a price to using the rest of the silicon... (or perhaps more accurately, a reward for not using it that we sometimes forget we've been getting).

这篇关于对于许多SIMD算法,为什么只有AVX的处理器性能要优于AVX2处理器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆