使用Ivy Bridge和Haswell实现最大吞吐量的循环展开 [英] Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

查看:225
本文介绍了使用Ivy Bridge和Haswell实现最大吞吐量的循环展开的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用AVX一次计算八个点积。在我目前的代码中,我这样做(展开前):

I am computing eight dot products at once with AVX. In my current code I do something like this (before unrolling):

Ivy-Bridge / Sandy-Bridge

Ivy-Bridge/Sandy-Bridge

__m256 areg0 = _mm256_set1_ps(a[m]);
for(int i=0; i<n; i++) {        
    __m256 breg0 = _mm256_load_ps(&b[8*i]);
    tmp0 = _mm256_add_ps(_mm256_mul_ps(arge0,breg0), tmp0); 
}

Haswell

__m256 areg0 = _mm256_set1_ps(a[m]);
for(int i=0; i<n; i++) {      
    __m256 breg0 = _mm256_load_ps(&b[8*i]);
    tmp0 = _mm256_fmadd_ps(arge0, breg0, tmp0);
}

我需要多少次展开循环以确保最大的吞吐量?

对于Haswell使用FMA3我认为答案在这里

For Haswell using FMA3 I think the answer is here FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2. I need to unroll the loop 10 times.

对于Ivy Bridge,我认为这是8.这是我的逻辑。 AVX增加的延迟为3,乘法延迟为5. Ivy Bridge可以使用不同的端口同时进行一次AVX乘法和一次AVX加法。使用符号m用于乘法,a用于加法,x用于无操作,以及用于指示部分和的数字(例如m5表示与第5部分和的乘法),我可以写为:

For Ivy Bridge I think it's 8. Here is my logic. The AVX addition has a latency of 3 and the multiplication a latency of 5. Ivy Bridge can do one AVX multiplication and one AVX addition at the same time using different ports. Using the notation m for multiplication, a for addition, and x for no operation as well as a number to indicate the partial sum (e.g. m5 means multiplication with the 5th partial sum) I can write:

port0:  m1  m2  m3  m4  m5  m6  m7  m8  m1  m2  m3  m4  m5  ... 
port1:   x   x   x   x   x  a1  a2  a3  a4  a5  a6  a7  a8  ...

因此,通过在9个时钟周期后使用8个部分和(四个来自负载,五个来自乘法)我可以在每个时钟周期提交一个AVX加载,一个AVX加法和一个AVX乘法。

So by using 8 partial sums after nine clock cycles (four from the load and five from the multiplication) I can submit one AVX load, one AVX addition and one AVX multiplication every clock cycle.

我想这意味着它不可能实现最大吞吐量这个任务在32位模式下的Ivy Bridge和Haswell从32位模式只有八个AVX寄存器?

>关于赏金。我的主要问题仍然存在。我想获得上面的Ivy Bridge或Haswell函数的最大吞吐量, n 可以是大于或等于64的任何值。我认为这只能使用展开(Ivy Bridge为8次,Haswell为10次)。如果你认为这可以用另一种方法,然后让我们看看它。在某种意义上,这是如何在每个周期实现4个FLOP的变体。但是,不是只有乘法和加法,我正在寻找一个256位负载(或两个128位加载),一个AVX乘法和一个AVX加法每个时钟周期与Ivy桥或两个256位加载和两个FMA3指令每个时钟周期。

In regards to the bounty. My main questions still hold. I want to get the maximum throughput of either the Ivy Bridge or Haswell functions above, n can be any value greater than or equal to 64. I think this can only be done using unrolling (eight times for Ivy Bridge and 10 times for Haswell). If you think this can be done with another method then let's see it. In some sense this is a variation of how to achieve 4 FLOPs per cycle. But instead of only multiplication and addition I'm looking for one 256-bit load (or two 128-bit loads), one AVX multiplication, and one AVX addition every clock cycle with Ivy Bridge or two 256-bit loads and two FMA3 instructions per clock cycle.

我也想知道有多少寄存器是必要的。对于Ivy Bridge,我认为它是10.一个用于广播,一个用于负载(仅一个由于寄存器重命名),八个用于八个部分和。所以我不认为这可以在32位模式下完成(事实上,当我在32位模式下运行性能下降明显)。

I would also like to know how many registers are necessary. For Ivy Bridge I think it's 10. One for the broadcast, one for the load (only one due to register renaming), and eight for the eight partial sums. So I don't think this can be done in 32-bit mode (and indeed when I run in 32-bit mode the performance drops significantly).

我应该指向编译器可能会产生误导结果高度优化的矩阵多代码的MSVC和GCC之间的性能差异

I should point out that the compiler can give misleading results Difference in performance between MSVC and GCC for highly optimized matrix multplication code

我用于Ivy Bridge的当前函数如下。这基本上将一个64x64矩阵的一行乘以一个64x64矩阵 b (我运行这个函数64次 a 的每一行,以获得矩阵中的完整矩阵乘法 c )。

The current function I'm using for Ivy Bridge is below. This basically multiplies one row of a 64x64 matrix a with all of a 64x64 matrix b (I run this function 64 times on each row of a to get the full matrix multiply in matrix c).

#include <immintrin.h>
extern "C" void row_m64x64(const float *a, const float *b, float *c) {      
    const int vec_size = 8;
    const int n = 64;
    __m256 tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
    tmp0 = _mm256_loadu_ps(&c[0*vec_size]);
    tmp1 = _mm256_loadu_ps(&c[1*vec_size]);
    tmp2 = _mm256_loadu_ps(&c[2*vec_size]);
    tmp3 = _mm256_loadu_ps(&c[3*vec_size]);
    tmp4 = _mm256_loadu_ps(&c[4*vec_size]);
    tmp5 = _mm256_loadu_ps(&c[5*vec_size]);
    tmp6 = _mm256_loadu_ps(&c[6*vec_size]);
    tmp7 = _mm256_loadu_ps(&c[7*vec_size]);

    for(int i=0; i<n; i++) {
        __m256 areg0 = _mm256_set1_ps(a[i]);

        __m256 breg0 = _mm256_loadu_ps(&b[vec_size*(8*i + 0)]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0);    
        __m256 breg1 = _mm256_loadu_ps(&b[vec_size*(8*i + 1)]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(areg0,breg1), tmp1);
        __m256 breg2 = _mm256_loadu_ps(&b[vec_size*(8*i + 2)]);
        tmp2 = _mm256_add_ps(_mm256_mul_ps(areg0,breg2), tmp2);    
        __m256 breg3 = _mm256_loadu_ps(&b[vec_size*(8*i + 3)]);
        tmp3 = _mm256_add_ps(_mm256_mul_ps(areg0,breg3), tmp3);   
        __m256 breg4 = _mm256_loadu_ps(&b[vec_size*(8*i + 4)]);
        tmp4 = _mm256_add_ps(_mm256_mul_ps(areg0,breg4), tmp4);    
        __m256 breg5 = _mm256_loadu_ps(&b[vec_size*(8*i + 5)]);
        tmp5 = _mm256_add_ps(_mm256_mul_ps(areg0,breg5), tmp5);    
        __m256 breg6 = _mm256_loadu_ps(&b[vec_size*(8*i + 6)]);
        tmp6 = _mm256_add_ps(_mm256_mul_ps(areg0,breg6), tmp6);    
        __m256 breg7 = _mm256_loadu_ps(&b[vec_size*(8*i + 7)]);
        tmp7 = _mm256_add_ps(_mm256_mul_ps(areg0,breg7), tmp7);    
    }
    _mm256_storeu_ps(&c[0*vec_size], tmp0);
    _mm256_storeu_ps(&c[1*vec_size], tmp1);
    _mm256_storeu_ps(&c[2*vec_size], tmp2);
    _mm256_storeu_ps(&c[3*vec_size], tmp3);
    _mm256_storeu_ps(&c[4*vec_size], tmp4);
    _mm256_storeu_ps(&c[5*vec_size], tmp5);
    _mm256_storeu_ps(&c[6*vec_size], tmp6);
    _mm256_storeu_ps(&c[7*vec_size], tmp7);
}


推荐答案

需要通过3展开:


  • 只有FP Add对循环的上一次迭代有依赖性


  • 因此,通过3/1 = 3展开会完全隐藏延迟
  • li>
  • FP Mul和FP Load对前一次迭代没有依赖性,您可以依靠OoO核心以接近最优的顺序发出它们。只有当这些指令降低FP Add的吞吐量时,这些指令才会影响展开因子(不是这里的情况,FP Load + FP Add + FP Mul可以在每个周期发出)。

  • Only FP Add has dependency on the previous iteration of the loop
  • FP Add can issue every cycle
  • FP Add takes three cycles to complete
  • Thus unrolling by 3/1 = 3 completely hides the latency
  • FP Mul and FP Load do not have a dependency on the previous iteration and you can rely on the OoO core to issue them in the near-optimal order. These instructions could affect the unroll factor only if they lowered the throughput of FP Add (not the case here, FP Load + FP Add + FP Mul can issue every cycle).

对于Haswell,您需要通过10展开:

For Haswell you need to unroll by 10:


  • 只有FMA对前一次迭代

  • FMA可以在每个周期重复发布(即平均独立指令需要0.5个周期)

  • FMA的延迟为5

  • 因此,通过5 / 0.5 = 10展开完全隐藏了FMA延迟

  • 两个FP Load微操作对前一次迭代没有依赖性, 2x FMA,因此不会影响展开因素。

  • Only FMA has dependency on the previous iteration of the loop
  • FMA can double-issue every cycle (i.e. on average independent instructions take 0.5 cycles)
  • FMA has latency of 5
  • Thus unrolling by 5/0.5 = 10 completely hides FMA latency
  • The two FP Load microoperations do not have a dependency on the previous iteration, and can co-issue with 2x FMA, so they don't affect the unroll factor.

这篇关于使用Ivy Bridge和Haswell实现最大吞吐量的循环展开的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆