使用Intrinsics,Add + Mul的速度变慢-我在哪里错了? [英] Add+Mul become slower with Intrinsics - where am I wrong?

查看:117
本文介绍了使用Intrinsics,Add + Mul的速度变慢-我在哪里错了?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

具有此数组:

alignas(16) double c[voiceSize][blockSize];

这是我要优化的功能:

inline void Process(int voiceIndex, int blockSize) {    
    double *pC = c[voiceIndex];
    double value = start + step * delta;
    double deltaValue = rate * delta;

    for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
        pC[sampleIndex] = value + deltaValue * sampleIndex;
    }
}

这是我的内在(SSE2)尝试:

And this is my intrinsics (SSE2) attempt:

inline void Process(int voiceIndex, int blockSize) {    
    double *pC = c[voiceIndex];
    double value = start + step * delta;
    double deltaValue = rate * delta;

    __m128d value_add = _mm_set1_pd(value);
    __m128d deltaValue_mul = _mm_set1_pd(deltaValue);

    for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex += 2) {
        __m128d result_mul = _mm_setr_pd(sampleIndex, sampleIndex + 1);
        result_mul = _mm_mul_pd(result_mul, deltaValue_mul);
        result_mul = _mm_add_pd(result_mul, value_add);

        _mm_store_pd(pC + sampleIndex, result_mul);
    }   
}

不幸的是,这比标量"(即使是自动优化的)原始代码要慢:)

Which is slower than "scalar" (even if auto-optimized) original code, unfortunately :)

您认为瓶颈在哪里?我在哪里错了?

Where's the bottleneck in your opinion? Where am I wrong?

我正在使用 MSVC Release/x86 /02 优化标记(快速代码)

I'm using MSVC, Release/x86, /02 optimization flag (Favor fast code).

编辑:这样做(@wim建议),性能似乎比C版本要好:

EDIT: doing this (suggested by @wim), it seems that performance become better than C version:

inline void Process(int voiceIndex, int blockSize) {    
    double *pC = c[voiceIndex];
    double value = start + step * delta;
    double deltaValue = rate * delta;

    __m128d value_add = _mm_set1_pd(value);
    __m128d deltaValue_mul = _mm_set1_pd(deltaValue);

    __m128d sampleIndex_acc = _mm_set_pd(-1.0, -2.0);
    __m128d sampleIndex_add = _mm_set1_pd(2.0);

    for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex += 2) {
        sampleIndex_acc = _mm_add_pd(sampleIndex_acc, sampleIndex_add);
        __m128d result_mul = _mm_mul_pd(sampleIndex_acc, deltaValue_mul);
        result_mul = _mm_add_pd(result_mul, value_add);

        _mm_store_pd(pC + sampleIndex, result_mul);
    }
}

为什么? _mm_setr_pd 是否昂贵?

推荐答案

为什么?_mm_setr_pd价格昂贵吗?

Why? Is _mm_setr_pd expensive?

有点;它至少需要洗牌.在这种情况下,更重要的是,计算每个标量操作数是昂贵的,并且正如@spectras的答案所示,gcc至少无法将其自动矢量化为 paddd / cvtdq2pd .而是从标量整数重新计算每个操作数,分别进行 int -> double 转换,然后将它们洗牌在一起.

Somewhat; it takes at least a shuffle. More importantly in this case, computing each scalar operand is expensive, and as @spectras' answer shows, gcc at least fails to auto-vectorize that into paddd / cvtdq2pd. Instead it re-computes each operand from a scalar integer, doing the int->double conversion separately, then shuffles those together.

这是我要优化的功能:

This is the function I'm trying to optimize:

您只是用线性函数填充数组.您每次在循环中都在进行乘法运算.这样可以避免对整数循环计数器以外的任何东西进行循环携带的依赖,但是您会在循环内部进行大量工作时遇到吞吐量瓶颈.

You're simply filling an array with a linear function. You're re-multiplying every time inside the loop. That avoids a loop-carried dependency on anything except the integer loop counter, but you run into throughput bottlenecks from doing so much work inside the loop.

即您正在为每个步骤分别计算 a [i] = c + i * scale .但是,相反,您可以将其强度降低为 a [i + n] = a [i] +(n * scale).因此,每个结果向量只有一条 addpd 指令.

i.e. you're computing a[i] = c + i*scale separately for every step. But instead you can strength-reduce that to a[i+n] = a[i] + (n*scale). So you only have one addpd instruction per vector of results.

这将引入一些舍入误差,这些误差会与从头开始进行计算重做或重做,但是 double 可能对于您正在做的事情来说是过大的.

This will introduce some rounding error that accumulates vs. redoing the computation from scratch, but double is probably overkill for what you're doing anyway.

这还以引入对FP add的串行依赖性(而不是整数)为代价.但是您在优化"版本中已经有了循环承载的FP添加依赖项链,该循环在循环内使用 sampleIndex_acc = _mm_add_pd(sampleIndex_acc,sampleIndex_add); ,使用FP + = 2.0而不是重新转换从整数开始.

It also comes at the cost of introducing a serial dependency on an FP add instead of integer. But you already have a loop-carried FP add dependency chain in your "optimized" version that uses sampleIndex_acc = _mm_add_pd(sampleIndex_acc, sampleIndex_add); inside the loop, using FP += 2.0 instead of re-converting from integer.

因此,您需要展开多个向量以隐藏FP延迟,并一次至少运行3或4个FP添加.(Haswell:3个周期延迟,每个时钟吞吐量一个.Skylake:4个周期延迟,每个时钟吞吐量2个.)另请参见

So you'll want to unroll with multiple vectors to hide that FP latency, and keep at least 3 or 4 FP additions in flight at once. (Haswell: 3 cycle latency, one per clock throughput. Skylake: 4 cycle latency, 2 per clock throughput.) See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about unrolling with multiple accumulators for a similar problem with loop-carried dependencies (a dot product).

void Process(int voiceIndex, int blockSize) {    
    double *pC = c[voiceIndex];
    double val0 = start + step * delta;
    double deltaValue = rate * delta;

    __m128d vdelta2 = _mm_set1_pd(2 * deltaValue);
    __m128d vdelta4 = _mm_add_pd(vdelta2, vdelta2);

    __m128d v0 = _mm_setr_pd(val0, val0 + deltaValue);
    __m128d v1 = _mm_add_pd(v0, vdelta2);
    __m128d v2 = _mm_add_pd(v0, vdelta4);
    __m128d v3 = _mm_add_pd(v1, vdelta4);

    __m128d vdelta8 = _mm_mul_pd(vdelta2, _mm_set1_pd(4.0));

    double *endp = pC + blocksize - 7;  // stop if there's only room for 7 or fewer doubles
      // or use -8 and have your cleanup handle lengths of 1..8
      // since the inner loop always calculates results for next iteration
    for (; pC < endp ; pC += 8) {
        _mm_store_pd(pC, v0);
        v0 = _mm_add_pd(v0, vdelta8);

        _mm_store_pd(pC+2, v1);
        v1 = _mm_add_pd(v1, vdelta8);

        _mm_store_pd(pC+4, v2);
        v2 = _mm_add_pd(v2, vdelta8);

        _mm_store_pd(pC+6, v3);
        v3 = _mm_add_pd(v3, vdelta8);
    }
    // if (blocksize % 8 != 0) ... store final vectors
}

在构建 vdelta4 / vdelta8 时是加还是乘的选择不是很重要;我试图避免过长的依赖关系链,然后才能进行首批存储.由于还需要计算 v0 v3 ,因此创建一个 vdelta4 似乎是有意义的,而不是仅仅创建一个链> v2 = v1 + vdelta2 .也许最好是从 4.0 * delta 乘以一个 vdelta4 ,然后将其加倍以得到 vdelta8 .这可能与非常小的块大小有关,特别是如果您仅在需要读取该数组之前通过仅生成该数组的小块来对其进行缓存来阻止代码的时候.

The choice of whether to add or multiply when building up vdelta4 / vdelta8 is not very significant; I tried to avoid too long a dependency chain before the first stores can happen. Since v0 through v3 need to be calculated as well, it seemed to make sense to create a vdelta4 instead of just making a chain of v2 = v1+vdelta2. Maybe it would have been better to create vdelta4 with a multiply from 4.0*delta, and double it to get vdelta8. This could be relevant for very small block size, especially if you cache-block your code by only generating small chunks of this array as needed, right before it will be read.

无论如何,这编译到一个非常高效的内环与gcc和MSVC(上Godbolt编译器资源管理器).

Anyway, this compiles to a very efficient inner loop with gcc and MSVC (on the Godbolt compiler explorer).

;; MSVC -O2
$LL4@Process:                    ; do {
    movups  XMMWORD PTR [rax], xmm5
    movups  XMMWORD PTR [rax+16], xmm0
    movups  XMMWORD PTR [rax+32], xmm1
    movups  XMMWORD PTR [rax+48], xmm2
    add     rax, 64                             ; 00000040H
    addpd   xmm5, xmm3              ; v0 += vdelta8
    addpd   xmm0, xmm3              ; v1 += vdelta8
    addpd   xmm1, xmm3              ; v2 += vdelta8
    addpd   xmm2, xmm3              ; v3 += vdelta8
    cmp     rax, rcx
    jb      SHORT $LL4@Process   ; }while(pC < endp)

这具有4个独立的依赖链,分别通过xmm0、1、2和5进行.因此,有足够的指令级并行性来保持4个 addpd 指令的运行.对于Haswell来说,这绰绰有余,但是Skylake可以维持的一半.

This has 4 separate dependency chains, through xmm0, 1, 2, and 5. So there's enough instruction-level parallelism to keep 4 addpd instructions in flight. This is more than enough for Haswell, but half of what Skylake can sustain.

仍然,每个时钟的存储吞吐量为1个向量,每个时钟超过1个 addpd 无效.从理论上讲,这可以在每个时钟周期运行大约16个字节,并且可以使存储吞吐量达到饱和.,即每个时钟1个向量/2个 double 个.

Still, with a store throughput of 1 vector per clock, more than 1 addpd per clock isn't useful. In theory this can run at about 16 bytes per clock cycle, and saturate store throughput. i.e. 1 vector / 2 doubles per clock.

AVX在Haswell及更高版本上仍可以每个时钟1个向量运行,即每个时钟32个字节.(假设输出阵列在L1d高速缓存中甚至在L2中处于高温状态.)

AVX with wider vectors (4 doubles) could still go at 1 vector per clock on Haswell and later, i.e. 32 bytes per clock. (Assuming the output array is hot in L1d cache or possibly even L2.)

如果使用它的代码只能读取几次,并且也需要手动矢量化,则可以在需要时即时生成它.

Generate it on the fly when it's needed, if the code consuming it only reads it a few times, and is also manually vectorized.

这篇关于使用Intrinsics,Add + Mul的速度变慢-我在哪里错了?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆