我如何臻峰CPU性能有了点产品? [英] How Do I Attain Peak CPU Performance With Dot Product?

查看:199
本文介绍了我如何臻峰CPU性能有了点产品?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题

我一直在研究HPC,特别是使用矩阵乘法为我的项目(​​见我的其他职位的配置文件)。我实现这些性能好,但还不够好。我退后一步,看看我是如何用好一个点积运算做的。

I have been studying HPC, specifically using matrix multiplication as my project (see my other posts in profile). I achieve good performance in those, but not good enough. I am taking a step back to see how well I can do with a dot product calculation.

点产品与矩阵乘法

点积比较简单,而且可以让我测试HPC概念,无需处理包装等相关问题。缓存阻塞仍然是一个问题,形成我的第二个问题。

The dot product is simpler, and will allow me to test HPC concepts without dealing with packing and other related issues. Cache blocking is still an issue, which forms my second question.

算法

N 相应的两个双击阵列 A 和 B ,总结他们。 A 装配双积只是由一系列 MOVAPD mulpd addpd 。展开并安排在一个巧妙的方式,它可能有 MOVAPD / mulpd / addpd 是不同的 XMM 寄存器进行操作,因此是独立的,优化流水线。当然,事实证明,这并不要紧这么多,因为我的CPU具有乱序执行。还要注意的是重新安排需要剥离最后一次迭代

Multiply n corresponding elements in two double arrays A and B and sum them. A double dot product in assembly is just a series of movapd, mulpd, addpd. Unrolled and arranged in a clever way, it is possible to have groups of movapd/mulpd/addpd that operate on different xmm registers and are thus independent, optimizing pipelining. Of course, it turns out that this does not matter so much as my CPU has out-of-order execution. Also note that the re-arrangement requires peeling off the last iteration.

其他假设

我不写code一般点的产品。在code是特定的尺寸,我不处理的情况下边缘。这仅仅是测试HPC概念,看看我能获得什么类型的CPU占用率。

I am not writing the code for general dot products. The code is for specific sizes and I am not handling fringe cases. This is just to test HPC concepts and to see what type of CPU usage I can attain.

结果

编译时的gcc -std = C99 -O2 -m32 -mincoming堆栈边界= 2 -msse3 -mfpmath =上证387 -masm =英特尔。我在不同的计算机比平常上。这台计算机有一个酷睿i5540米能获得 2.8 GHz的* 4 FLOPS /循环/核心=每核心11.2 GFLOPS / S 两步英特尔睿频加速后(两者核心是在现在这样只得到2步......一个4步提升是可能的,如果我关掉一个核心)。设置为同一个线程中运行时的32位LINPACK得到周围9.5 GFLOPS /秒​​。

Compiled with gcc -std=c99 -O2 -m32 -mincoming-stack-boundary=2 -msse3 -mfpmath=sse,387 -masm=intel. I am on a different computer than usual. This computer has a i5 540m which can obtain 2.8 GHz * 4 FLOPS/cycle/core = 11.2 GFLOPS/s per core after a two-step Intel Turbo Boost (both cores are on right now so it only gets 2 step...a 4 step boost is possible if I turn off one core). 32 bit LINPACK gets around 9.5 GFLOPS/s when set to run with one thread.

       N   Total Gflops/s         Residual
     256         5.580521    1.421085e-014
     384         5.734344   -2.842171e-014
     512         5.791168    0.000000e+000
     640         5.821629    0.000000e+000
     768         5.814255    2.842171e-014
     896         5.807132    0.000000e+000
    1024         5.817208   -1.421085e-013
    1152         5.805388    0.000000e+000
    1280         5.830746   -5.684342e-014
    1408         5.881937   -5.684342e-014
    1536         5.872159   -1.705303e-013
    1664         5.881536    5.684342e-014
    1792         5.906261   -2.842171e-013
    1920         5.477966    2.273737e-013
    2048         5.620931    0.000000e+000
    2176         3.998713    1.136868e-013
    2304         3.370095   -3.410605e-013
    2432         3.371386   -3.410605e-013

问题1

我怎样才能做到比这更好的?我不来,甚至接近峰值性能。我优化了装配code连天。进一步的展开可能会提高它只是多一点,但不展开似乎降低性能。

How can I do better than this? I am not even coming close to the peak performance. I have optimized the assembly code to high heaven. Further unrolling might boost it just a little more, but less unrolling seems to degrade performance.

问题2

N'GT; 2048 ,你可以看到性能下降。这是因为我的L1缓存为32KB,而当 N = 2048 A B 双击,他们采取了整个高速缓存。任何更大,他们是从内存流。

When n > 2048, you can see a drop in performance. This is because my L1 cache is 32KB, and when n = 2048 and A and B are double, they take up the entire cache. Any bigger and they are streamed from memory.

我试图阻止缓存(源未显示),但也许我没有错。任何人都可以提供一些code或解释如何阻止积为缓存?

I tried cache blocking (not shown in source), but maybe I did it wrong. Can anyone provide some code or explain how to block a dot product for a cache?

来源$ C ​​$ C

    #include <stdio.h>
    #include <time.h>
    #include <stdlib.h>
    #include <string.h>
    #include <x86intrin.h>
    #include <math.h>
    #include <omp.h>
    #include <stdint.h>
    #include <windows.h>

    // computes 8 dot products
#define KERNEL(address) \
            "movapd     xmm4, XMMWORD PTR [eax+"#address"]      \n\t" \
            "mulpd      xmm7, XMMWORD PTR [edx+48+"#address"]   \n\t" \
            "addpd      xmm2, xmm6                              \n\t" \
            "movapd     xmm5, XMMWORD PTR [eax+16+"#address"]   \n\t" \
            "mulpd      xmm4, XMMWORD PTR [edx+"#address"]      \n\t" \
            "addpd      xmm3, xmm7                              \n\t" \
            "movapd     xmm6, XMMWORD PTR [eax+96+"#address"]   \n\t" \
            "mulpd      xmm5, XMMWORD PTR [edx+16+"#address"]   \n\t" \
            "addpd      xmm0, xmm4                              \n\t" \
            "movapd     xmm7, XMMWORD PTR [eax+112+"#address"]  \n\t" \
            "mulpd      xmm6, XMMWORD PTR [edx+96+"#address"]   \n\t" \
            "addpd      xmm1, xmm5                              \n\t"

#define PEELED(address) \
            "movapd     xmm4, XMMWORD PTR [eax+"#address"]      \n\t" \
            "mulpd      xmm7, [edx+48+"#address"]               \n\t" \
            "addpd      xmm2, xmm6                  \n\t" \
            "movapd     xmm5, XMMWORD PTR [eax+16+"#address"]   \n\t" \
            "mulpd      xmm4, XMMWORD PTR [edx+"#address"]      \n\t" \
            "addpd      xmm3, xmm7                  \n\t" \
            "mulpd      xmm5, XMMWORD PTR [edx+16+"#address"]   \n\t" \
            "addpd      xmm0, xmm4                  \n\t" \
            "addpd      xmm1, xmm5                  \n\t"

inline double 
__attribute__ ((gnu_inline))        
__attribute__ ((aligned(64))) ddot_ref(
    int n,
    const double* restrict A,
    const double* restrict B)
{
    double sum0 = 0.0;
    double sum1 = 0.0;
    double sum2 = 0.0;
    double sum3 = 0.0;
    double sum;
    for(int i = 0; i < n; i+=4) {
        sum0 += *(A + i  ) * *(B + i  );
        sum1 += *(A + i+1) * *(B + i+1);
        sum2 += *(A + i+2) * *(B + i+2);
        sum3 += *(A + i+3) * *(B + i+3);
    }
    sum = sum0+sum1+sum2+sum3;
    return(sum);
}

inline double 
__attribute__ ((gnu_inline))        
__attribute__ ((aligned(64))) ddot_asm
(   int n,
    const double* restrict A,
    const double* restrict B)
{

        double sum;

            __asm__ __volatile__
        (
            "mov        eax, %[A]                   \n\t"
            "mov        edx, %[B]                   \n\t"
            "mov        ecx, %[n]                   \n\t"
            "pxor       xmm0, xmm0                  \n\t"
            "pxor       xmm1, xmm1                  \n\t"
            "pxor       xmm2, xmm2                  \n\t"
            "pxor       xmm3, xmm3                  \n\t"
            "movapd     xmm6, XMMWORD PTR [eax+32]  \n\t"
            "movapd     xmm7, XMMWORD PTR [eax+48]  \n\t"
            "mulpd      xmm6, XMMWORD PTR [edx+32]  \n\t"
            "sar        ecx, 7                      \n\t"
            "sub        ecx, 1                      \n\t" // peel
            "L%=:                                   \n\t"
            KERNEL(64   *   0)
            KERNEL(64   *   1)
            KERNEL(64   *   2)
            KERNEL(64   *   3)
            KERNEL(64   *   4)
            KERNEL(64   *   5)
            KERNEL(64   *   6)
            KERNEL(64   *   7)
            KERNEL(64   *   8)
            KERNEL(64   *   9)
            KERNEL(64   *   10)
            KERNEL(64   *   11)
            KERNEL(64   *   12)
            KERNEL(64   *   13)
            KERNEL(64   *   14)
            KERNEL(64   *   15)
            "lea        eax, [eax+1024]             \n\t"
            "lea        edx, [edx+1024]             \n\t"
            "                                       \n\t"
            "dec        ecx                         \n\t"
            "jnz        L%=                         \n\t" // end loop
            "                                       \n\t"
            KERNEL(64   *   0)
            KERNEL(64   *   1)
            KERNEL(64   *   2)
            KERNEL(64   *   3)
            KERNEL(64   *   4)
            KERNEL(64   *   5)
            KERNEL(64   *   6)
            KERNEL(64   *   7)
            KERNEL(64   *   8)
            KERNEL(64   *   9)
            KERNEL(64   *   10)
            KERNEL(64   *   11)
            KERNEL(64   *   12)
            KERNEL(64   *   13)
            KERNEL(64   *   14)
            PEELED(64   *   15)
            "                                       \n\t"
            "addpd      xmm0, xmm1                  \n\t" // summing result
            "addpd      xmm2, xmm3                  \n\t"
            "addpd      xmm0, xmm2                  \n\t" // cascading add
            "movapd     xmm1, xmm0                  \n\t" // copy xmm0
            "shufpd     xmm1, xmm0, 0x03            \n\t" // shuffle
            "addsd      xmm0, xmm1                  \n\t" // add low qword
            "movsd      %[sum], xmm0                \n\t" // mov low qw to sum
            : // outputs
            [sum]   "=m"    (sum)
            : // inputs
            [A] "m" (A),
            [B] "m" (B), 
            [n] "m" (n)
            : //register clobber
            "memory",
            "eax","ecx","edx","edi",
            "xmm0","xmm1","xmm2","xmm3","xmm4","xmm5","xmm6","xmm7"
            );
        return(sum);
}

int main()
{
    // timers
    LARGE_INTEGER frequency, time1, time2;
    double time3;
    QueryPerformanceFrequency(&frequency);
    // clock_t time1, time2;
    double gflops;

    int nmax = 4096;
    int trials = 1e7;
    double sum, residual;
    FILE *f = fopen("soddot.txt","w+");

    printf("%16s %16s %16s\n","N","Total Gflops/s","Residual");
    fprintf(f,"%16s %16s %16s\n","N","Total Gflops/s","Residual");

    for(int n = 256; n <= nmax; n += 128 ) {
        double* A = NULL;
        double* B = NULL;
        A = _mm_malloc(n*sizeof(*A), 64); if (!A) {printf("A failed\n"); return(1);}
        B = _mm_malloc(n*sizeof(*B), 64); if (!B) {printf("B failed\n"); return(1);}

        srand(time(NULL));

        // create arrays
        for(int i = 0; i < n; ++i) {
            *(A + i) = (double) rand()/RAND_MAX;
            *(B + i) = (double) rand()/RAND_MAX;
        }

        // warmup
        sum = ddot_asm(n,A,B);

        QueryPerformanceCounter(&time1);
        // time1 = clock();
        for (int count = 0; count < trials; count++){
            // sum = ddot_ref(n,A,B);
            sum = ddot_asm(n,A,B);
        }
        QueryPerformanceCounter(&time2);
        time3 = (double)(time2.QuadPart - time1.QuadPart) / frequency.QuadPart;
        // time3 = (double) (clock() - time1)/CLOCKS_PER_SEC;
        gflops = (double) (2.0*n*trials)/time3/1.0e9;
        residual = ddot_ref(n,A,B) - sum;
        printf("%16d %16f %16e\n",n,gflops,residual);
        fprintf(f,"%16d %16f %16e\n",n,gflops,residual);

        _mm_free(A);
        _mm_free(B);
    }
    fclose(f);
    return(0); // successful completion
}

编辑:组装说明

一个点积只是两个数字产品的重复和:之和+ = A [I] * B [I] 必须在第一次迭代之前被初始化为 0 [sum0 SUM1] = [A [I] A [I + 1]] * [B [I】B【我:矢量,你可以在它必须在最后总结时做2款项+1] 和= sum0 + SUM1 。在(英特尔)组装,这是3个步骤(初始化后):

A dot product is just a repeat sum of products of two numbers: sum += a[i]*b[i]. sum must be initialized to 0 before the first iteration. Vectorized, you can do 2 sums at a time which must be summed at the end: [sum0 sum1] = [a[i] a[i+1]]*[b[i] b[i+1]], sum = sum0 + sum1. In (Intel) assembly, this is 3 steps (after the initialization):

pxor   xmm0, xmm0              // accumulator [sum0 sum1] = [0 0]
movapd xmm1, XMMWORD PTR [eax] // load [a[i] a[i+1]] into xmm1
mulpd  xmm1, XMMWORD PTR [edx] // xmm1 = xmm1 * [b[i] b[i+1]]
addpd  xmm0, xmm1              // xmm0 = xmm0 + xmm1

在这一点上,你有没有什么特别的,编译器能够想出这个。通常,您可以通过展开code足够的时间来使用所有 XMM 寄存器提供给你(8个寄存器在32位模式),获得更好的性能。所以,如果你把它打开4倍,让您能够利用所有8个寄存器 XMM0 XMM7 。您将有4累加器和4个寄存器用于存储 MOVAPD addpd 的结果。同样,编译器可以想出这个。真正的思想的一部分正试图想出一个办法管道code,即让每一个指令组MOV中/ MUL / ADD在不同的寄存器进行操作,使所有3个指令在同一时间执行(通常在大多数的CPU的情况下)。这就是你打的编译器。所以,你必须格局4X展开code能够做到这一点,这可能需要提前和剥落的第一个或最后一个迭代载荷向量。这是内核(地址)是。我做了4倍的宏观展开流水线code方便。这样,我可以很容易地通过只是改变地址展开它在4的倍数。每个内核计算8点产品。

At this point you have nothing special, the compiler can come up with this. You can usually get better performance by unrolling the code enough times to use all xmm registers available to you (8 registers in 32 bit mode). So if you unroll it 4 times that allows you to utilize all 8 registers xmm0 through xmm7. You will have 4 accumulators and 4 registers for storing the results of movapd and addpd. Again, the compiler can come up with this. The real thinking part is trying to come up with a way to pipeline the code, i.e., make each instruction in the group of MOV/MUL/ADD operate on different registers so that all 3 instructions execute at the same time (usually the case on most CPUs). That's how you beat the compiler. So you have to pattern the 4x unrolled code to do just that, which may require loading vectors ahead of time and peeling off the first or last iteration. This is what KERNEL(address) is. I made a macro of the 4x unrolled pipelined code for convenience. That way I can easily unroll it in multiples of 4 by just changing address. Each KERNEL computes 8 dot products.

推荐答案

要回答你的问题,总不能达到与点积峰值性能。

To answer your overall question you can't achieve peak performance with the dot product.

问题是,你的CPU可以做一个128位的负载每时钟周期,做点积需要每时钟周期两个128位的负载。

The problem is that your CPU can do one 128-bit load per clock cycle and to do the dot product you need two 128-bit loads per clock cycle.

但它比,对于大的n恶化。回答你的第二个问题是,点积内存约束,而不是计算绑定,因此它不能并行的大的n快速内核。这是更好的解释这里的why-vectorizing-the-loop-does-not-have-performance-improvement.这是一个大问题,快速核并行化。我花了一段时间来想出解决办法,但它的学习是非常重要的。

But it's worse than that for large n. The answer to your second question is that the dot product is memory bound and not compute bound and so it cannot parallelize for large n with fast cores. This is explained better here why-vectorizing-the-loop-does-not-have-performance-improvement. This is a big problem with parallelization with fast cores. It took me a while to figure this out but it's very important to learn.

有实际上可以从并行化中充分受益于快速核数基本算法。在BLAS方面算法也只为3级算法(为O(n ^ 3)),如矩阵乘法,真正从并行化中获益。这种情况是缓慢的内核例如更好与图形处理器和至强融核,因为内存的速度和核心速度之间的差异要小得多。

There are actually few basic algorithms that can fully benefit from parallelization on fast cores. In terms of BLAS algorithms it's only the Level-3 algorithms (O(n^3)) such as matrix multiplication that really benefit from parallelization. The situation is better on slow cores e.g. with GPUs and the Xeon Phi because the discrepancy between memory speed and core speed is much smaller.

如果你想找到一种算法,可以得到接近峰值为触发器小样本尝试例如标量*向量或标量*矢量的总和。第一种情况下应该做的一个负载,一是MULT和一个存储每个时钟周期和第二情况下,一个多重峰一增加,和一个负载每个时钟周期

If you want to find an algorithm which can get close to peak flops for small n try e.g. scalar * vector or the sum of scalar * vector. The first case should do one load, one mult, and one store every clock cycle and the second case one mult, one add, and one load every clock cycle.

我的Knoppix 7.3 32位测试的酷睿2 P9600@2.67GHz以下code。的我获得的标量积的峰值的约75%和peakfor的标量积的总和的75%。为标量积的触发器/周期为2个,为标量积的总和这是4。

I tested the following code on a Core 2 Duo P9600@2.67GHz in Knoppix 7.3 32-bit. I get about 75% of the peak for the scalar product and 75% of the peakfor the sum of the scalar product. The flops/cycle for the scalar product is 2 and for the sum of the scalar product it's 4.

G ++ -msse2 -O3 -fopenmp Foo.cpp中-ffast,数学编译

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <x86intrin.h>

void scalar_product(double * __restrict a, int n) {
    a = (double*)__builtin_assume_aligned (a, 64);
    double k = 3.14159;
    for(int i=0; i<n; i++) {
        a[i] = k*a[i]; 
    }
}

void scalar_product_SSE(double * __restrict a, int n) {
    a = (double*)__builtin_assume_aligned (a, 64);
    __m128d k = _mm_set1_pd(3.14159);    
    for(int i=0; i<n; i+=8) {
        __m128d t1 = _mm_load_pd(&a[i+0]);
        _mm_store_pd(&a[i],_mm_mul_pd(k,t1));
        __m128d t2 = _mm_load_pd(&a[i+2]);
        _mm_store_pd(&a[i+2],_mm_mul_pd(k,t2));
        __m128d t3 = _mm_load_pd(&a[i+4]);
        _mm_store_pd(&a[i+4],_mm_mul_pd(k,t3));
        __m128d t4 = _mm_load_pd(&a[i+6]);
        _mm_store_pd(&a[i+6],_mm_mul_pd(k,t4));
    }
}

double scalar_sum(double * __restrict a, int n) {
    a = (double*)__builtin_assume_aligned (a, 64);
    double sum = 0.0;    
    double k = 3.14159;
    for(int i=0; i<n; i++) {
        sum += k*a[i]; 
    }
    return sum;
}

double scalar_sum_SSE(double * __restrict a, int n) {
    a = (double*)__builtin_assume_aligned (a, 64);
    __m128d sum1 = _mm_setzero_pd();
    __m128d sum2 = _mm_setzero_pd();
    __m128d sum3 = _mm_setzero_pd();
    __m128d sum4 = _mm_setzero_pd();
    __m128d k = _mm_set1_pd(3.14159);   
    for(int i=0; i<n; i+=8) {
        __m128d t1 = _mm_load_pd(&a[i+0]);
        sum1 = _mm_add_pd(_mm_mul_pd(k,t1),sum1);
        __m128d t2 = _mm_load_pd(&a[i+2]);
        sum2 = _mm_add_pd(_mm_mul_pd(k,t2),sum2);
        __m128d t3 = _mm_load_pd(&a[i+4]);
        sum3 = _mm_add_pd(_mm_mul_pd(k,t3),sum3);
        __m128d t4 = _mm_load_pd(&a[i+6]);
        sum4 = _mm_add_pd(_mm_mul_pd(k,t4),sum4);      
    }
    double tmp[8];
    _mm_storeu_pd(&tmp[0],sum1);
    _mm_storeu_pd(&tmp[2],sum2);
    _mm_storeu_pd(&tmp[4],sum3);
    _mm_storeu_pd(&tmp[6],sum4);
    double sum = 0;
    for(int i=0; i<8; i++) sum+=tmp[i];
    return sum;
}

int main() {
    //_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
    //_mm_setcsr(_mm_getcsr() | 0x8040);
    double dtime, peak, flops, sum;
    int repeat = 1<<18;
    const int n = 2048;
    double *a = (double*)_mm_malloc(sizeof(double)*n,64);
    double *b = (double*)_mm_malloc(sizeof(double)*n,64);
    for(int i=0; i<n; i++) a[i] = 1.0*rand()/RAND_MAX;

    dtime = omp_get_wtime();
    for(int r=0; r<repeat; r++) {
        scalar_product_SSE(a,n);
    }
    dtime = omp_get_wtime() - dtime;
    peak = 2*2.67;
    flops = 1.0*n/dtime*1E-9*repeat;
    printf("time %f, %f, %f\n", dtime,flops, flops/peak);

    //for(int i=0; i<n; i++) a[i] = 1.0*rand()/RAND_MAX;
    //sum = 0.0;    
    dtime = omp_get_wtime();
    for(int r=0; r<repeat; r++) {
        scalar_sum_SSE(a,n);
    }
    dtime = omp_get_wtime() - dtime;
    peak = 2*2*2.67;
    flops = 2.0*n/dtime*1E-9*repeat;
    printf("time %f, %f, %f\n", dtime,flops, flops/peak);
    //printf("sum %f\n", sum);

}

这篇关于我如何臻峰CPU性能有了点产品?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆