_mm256_fmadd_ps比_mm256_mul_ps + _mm256_add_ps慢? [英] _mm256_fmadd_ps is slower than _mm256_mul_ps + _mm256_add_ps?

查看:331
本文介绍了_mm256_fmadd_ps比_mm256_mul_ps + _mm256_add_ps慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个图像处理算法,可以使用AVX计算 a * b + c * d .伪代码如下:

I have an image processing algorithm to calculate a*b+c*d with AVX. The pseudo code is as follows:

float *a=new float[N];
float *b=new float[N];
float *c=new float[N];
float *d=new float[N];

//assign values to a, b, c and d
__m256 sum;
double start=cv::getTickCount();
for (int i = 0; i < n; i += 8) // assume that n is a multiple of 8
{
    __m256 am=_mm256_loadu_ps(a+i);
    __m256 bm=_mm256_loadu_ps(b+i);
    __m256 cm=_mm256_loadu_ps(c+i);
    __m256 dm=_mm256_loadu_ps(d+i);

    __m256 abm=_mm256_mul_ps(am, bm);
    __m256 cdm=_mm256_mul_ps(cm, dm);
    __m256 abcdm=_mm256_add_ps(abm, cdm);
    sum=_mm256_add_ps(sum, abcdm);
}
double time1=(cv::getTickCount()-start)/cv::getTickFrequency();

我将上面的_mm256_mul_ps和_mm256_add_ps更改为_mm256_fmadd_ps,如下所示:

I change _mm256_mul_ps and _mm256_add_ps on the above to _mm256_fmadd_ps as follows:

float *a=new float[N];
float *b=new float[N];
float *c=new float[N];
float *d=new float[N];

//assign values to a, b, c and d
__m256 sum;
double start=cv::getTickCount();
for (int i = 0; i < n; i += 8) // assume that n is a multiple of 8
{
    __m256 am=_mm256_loadu_ps(a+i);
    __m256 bm=_mm256_loadu_ps(b+i);
    __m256 cm=_mm256_loadu_ps(c+i);
    __m256 dm=_mm256_loadu_ps(d+i);

    sum=_mm256_fmadd_ps(am, bm, sum);
    sum=_mm256_fmadd_ps(cm, dm, sum);
}
double time2=(cv::getTickCount()-start)/cv::getTickFrequency();

但是下面的代码比上面的慢!上面的代码执行时间1为50ms,下面的代码执行时间2为90ms._mm256_fmadd_ps慢于_mm256_mul_ps + _mm256_add_ps ???

But the code below is slower than the above! The above code execution time1 is 50ms, the below code execution time2 is 90ms. _mm256_fmadd_ps is slower than _mm256_mul_ps + _mm256_add_ps ???

我使用Ubuntu 16.04,GCC 7.5.0,编译器标志:-fopenmp -march = native -O3

I use Ubuntu 16.04, GCC 7.5.0 ,compiler flags: -fopenmp -march=native -O3

推荐答案

由于只使用一个FP向量累加器,所以减少操作会同时导致延迟和吞吐量瓶颈.FMA速度较慢,因为您使关键路径更长(每个循环迭代由2条指令组成的链,而不是1条).

Your reduction loops both bottleneck on latency, not throughput, because you're only using one FP vector accumulator. The FMA one is slower because you made the critical path longer (a chain of 2 instructions per loop iteration instead of just 1).

add 情况下, sum 的循环携带依赖链仅为 sum = _mm256_add_ps(sum,abcdm); .其他指令对于每个迭代都是独立的,并且可以使 abcdm 输入准备就绪,而之前的 vaddps 已准备好该迭代的 sum .

In the add case, the loop carried dependency chain for sum is only sum=_mm256_add_ps(sum, abcdm);. The other instructions are independent for each iteration, and can have that abcdm input ready to go before the previous vaddps has this iteration's sum ready.

fma 情况下,循环传输的dep链通过两个 _mm256_fmadd_ps 操作,都进行到 sum 中,所以,是的,d希望它的速度大约是原来的两倍.

In the fma case, the loop-carried dep chain goes through two _mm256_fmadd_ps operations, both into sum, so yes, you'd expect it to be about twice as slow.

展开更多的累加器以隐藏FP延迟(就像点产品的正常情况一样).参见为何mulss只服用Haswell上有3个周期,不同于Agner的指令表?(展开具有多个累加器的FP循环),以了解有关该方面以及OoO exec的工作方式的更多信息.

Unroll with more accumulators to hide FP latency (like normal for a dot product). See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for much more detail about that and how OoO exec works.

另请参见

Also see Improving performance of floating-point dot-product of an array with SIMD for a much simpler beginner-friendly example of 2 accumulators.

(在循环之后,应将这些单独的 __ m256 sum0,sum1,sum2等变量相加.也可以使用 __ m256 sum [4] 保存输入您甚至可以在该数组上使用内部循环;大多数编译器将完全展开小的固定计数循环,因此您可以通过单独的YMM寄存器中的每个 __ m256 获得所需的展开式asm.)

(Adding up those separate __m256 sum0, sum1, sum2, etc vars should be done after the loop. You can also use __m256 sum[4] to save typing. You can even use an inner loop over that array; most compilers will fully unroll small fixed-count loops, so you get the desired unrolled asm with each __m256 in a separate YMM register.)

或者让clang自动将其矢量化;通常会为您使用多个累加器进行展开.

Or let clang auto-vectorize this; it will normally do that unrolling with multiple accumulators for you.

或者如果您出于某种原因不想展开,则可以使用FMA,同时通过 sum + = fma(a,b,c * d); 保持较低的循环延迟(1 mul,1 FMA,1添加).当然,假设您的编译器没有合同".如果您使用 -ffast-math 进行了编译,则可以使用mul并为您添加到FMA中;默认情况下,GCC会在所有语句中积极执行此操作,而clang不会.

Or if you for some reason didn't want to unroll, you could use FMA while keeping the loop-carried latency low with sum += fma(a, b, c*d); (one mul, one FMA, one add). Of course assuming your compiler didn't "contract" your mul and add into FMA for you if you compiled with -ffast-math; GCC will do that aggressively across statements by default, clang won't.

执行此操作后,每个时钟2个负载将导致吞吐量瓶颈(即使在没有高速缓存行拆分的对齐数组的情况下,这也是最佳情况, new 不会给你),因此使用FMA几乎无济于事,只能减少前端瓶颈.(与多个累加器mul/add版本相比,该版本需要每个负载运行1个FP op才能保持;使用多个累加器将使您比任何一个原始循环都快.就像每2个循环进行一次迭代(4个负载),而不是1每3个周期,并出现 vaddps 延迟瓶颈".

Once you do this, your throughput will bottleneck on 2 loads per clock (best case even with aligned arrays for no cache-line splits, which new won't give you), so using FMA barely helps except to reduce the front-end bottleneck. (Compared to a multiple accumulator mul/add version that needs to run 1 FP op per load to keep up; using multiple accumulators will let you go faster than either original loop. Like one iteration (4 loads) per 2 cycles, instead of 1 per 3 cycles with the vaddps latency bottleneck).

在Skylake及更高版本上,FMA/add/mul都具有相同的延迟:4个周期.在Haswell/Broadwell上,vaddps延迟为3个周期(一个专用FP添加单元),而FMA延迟为5.

On Skylake and later, FMA/add/mul all have the same latency: 4 cycles. On Haswell/Broadwell, vaddps latency is 3 cycles (one dedicated FP add unit) while FMA latency is 5.

Zen2具有3个周期vaddps,5个周期vfma .... ps( https://uops.info/).(两者的时钟吞吐量均为2个时钟,并且在不同的执行端口上,因此从理论上讲,您可以在Zen2上每个时钟运行2个FMA 2个vaddps.)

Zen2 has 3 cycle vaddps, 5 cycle vfma....ps (https://uops.info/). (2/clock throughput for both, and on different execution ports, so you could in theory run 2 FMAs and 2 vaddps per clock on Zen2.)

由于您的较长延迟FMA循环的速度不到慢两倍的速度,我猜您可能是在Skylake派生的CPU上.mul/add版本可能在前端或资源冲突或某些方面出现瓶颈,并且不能完全达到预期的每3个时钟延迟限制的速度进行一次迭代.

With your longer-latency FMA loop being less than twice as slow, I'm guessing you might be on a Skylake-derived CPU. Perhaps the mul/add version was bottlenecking a bit on the front-end or resource conflicts or something and not quite achieving the expected 1 iteration per 3 clocks latency-limited speed.

通常,请参见 https://uops.info/以了解延迟和uops/端口故障.(也是 https://agner.org/optimize/).

In general, see https://uops.info/ for latency and uops / port breakdowns. (also https://agner.org/optimize/).

这篇关于_mm256_fmadd_ps比_mm256_mul_ps + _mm256_add_ps慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆