使用OpenMP Cholesky分解 [英] Cholesky decomposition with OpenMP

查看:161
本文介绍了使用OpenMP Cholesky分解的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有,我们解决了大量的逆(超过3000x3000)的一个项目使用 Cholesky分解正定密集矩阵。该项目是在Java中,我们用正在使用CERN 柯尔特BLAS库。仿形code表明Cholesky分解是瓶颈。

我决定尝试使用OpenMP并行化Cholesky分解,并用它作为Java中的DLL(与JNA)。我开始与C中的Cholesky分解code从罗塞塔code

我注意到的是,在除了对角元素的列中的值是独立的。所以我决定来计算连续对角线元素和并行列的值的其余部分。我还交换环路的顺序,以使内部循环运行在行和多列的外环。串行版本比从罗塞塔code 一个稍微慢一点,但水货版本是比我的4芯(8 HT)系统上的罗塞塔code版本快六倍。使用在Java中的DLL加快由六倍我们的结果也是如此。这里是code:

 双*乔里斯基(双* A,INT N){
    双* L =(双*)释放calloc(n * n的,的sizeof(双));
    如果(L == NULL)
        出口(EXIT_FAILURE);    对于(INT J = 0; J< N; J ++){
        双S = 0;
        对于(INT K = 0; K<Ĵ; k ++){
            S + = L [J * N + K] * L [J * N + K]。
        }
        L [Ĵ* N + J] =开方(A [J * N + J] - S);
        OMP的#pragma为平行
        的for(int i = J + 1; I< N;我++){
            双S = 0;
            对于(INT K = 0; K<Ĵ; k ++){
                S + = L [I * N + K] * L [J * N + K]。
            }
            L [我* N + J] =(1.0 / L [J * N + J] *(A [I * N + J] - S));
        }
    }
    RETURN L;
}

您可以找到完整的code在 HTTP测试此://coliru.stacked -crooked.com/a/6f5750c20d456da9

我最初以为假共享将是一个问题,当某一列的其余元素相比,线程的数量小,但似乎并不如此。我试过

 的#pragma OMP并行的时间表(静态,8)//高速缓存行8双打

我还没有找到如何并行Choleskey分解明显的例子。我不知道我做了什么是理想的。例如,将它NUMA系统上正常运行?

也许是基于任务的方法是一般的好?在 http://courses.engr.illinois.edu幻灯片7-9 /cs554/fa2013/notes/07_cholesky.pdf 存在使用细粒任务平行Cholesky分解的一个例子。目前还不清楚我怎么还没有实现这一点。

我有两个问题,具体的和一般的。你有没有关于如何提高我的执行Cholesky分解的使用OpenMP有什么建议?您能否提供一个不同的实现Cholesky分解的使用OpenMP例如交任务?

编辑:这里是要求我用来计算取值的AVX功能。它并没有帮助

 双inner_sum_AVX(双*利,双​​ LJ,INT N){
    __m256d S4;
    INT I;
    双S;    S4 = _mm256_set1_pd(0.0);
    对于(I = 0; I&下;(正及(-4)); I + = 4){
        __m256d合谷,LJ4;
        合谷= _mm256_loadu_pd(安培;李[I]);
        LJ4 = _mm256_loadu_pd(安培; LJ [I]);
        S4 = _mm256_add_pd(_mm256_mul_pd(LI4,LJ4),S4);
    }
    加倍出[4];
    _mm256_storeu_pd(出,S4);
    S = OUT [0] +出[1] +出来[2] +出[3];
    对于(; I< N;我++){
        S + =李[I] * LJ [I]
    }
    返回S;
}


解决方案

我manged得到SIMD与Cholesky分解工作。我这样做是使用循环平铺,因为我已经在矩阵乘法之前。解决的办法是不平凡的。下面是次为5790x5790矩阵我的4核/ 8 HT Ivy Bridge的系统上(EFF = GFLOPS /(峰值GFLOPS)):

 双浮点峰值GFLOPS 118.1
1线程时间36.32 S,GFLOPS 1.78,EFF 1.5%
8线程时间7.99 S,GFLOPS 8.10,EFF 6.9%
4线程+ AVX时间1.36 S,GFLOPS 47.64,EFF 40.3%
4线程MKL时间0.68 S,GFLOPS 95.14,从LAPACKE_dpotrf EFF 80.6%//单浮点峰值GFLOPS 236.2
1线程时间33.88 S,GFLOPS 1.91,EFF 0.8%
8线程时间4.74 S,GFLOPS 13.64,EFF 5.8%
4线程+ AVX时间0.78 S,GFLOPS 82.61,EFF 35.0%

新的方法更快更快的单,双和40倍25倍。效率是高峰的35-40%,现在无人问津。随着矩阵乘法我起床与AVX 70%在自己的code。我不知道从Cholesky分解会发生什么。该算法是部分串行(计算角块,名为三角在我的code以下时),不像矩阵乘法。

更新:我的MKL的2倍之内。我不知道我是否应该感到自豪的是,或通过不好意思,但显然我的code仍可显著改善。我发现这是一个博士论文这表明我的块算法是一种常见的解决方案,所以我设法推倒重来。

我用浮法双64×64和32×32的瓷砖地砖。我还重新排列各拼贴是连续的,并且是其转置的内存。我定义了一个新的矩阵生产函数。矩阵乘法的定义是:

  C_I,J = A_I,K * B_k,J //总和用K

我意识到,在乔列斯基算法有非常类似的东西。

  C_J,I = A_I,K * B_j,K //总和用K

通过写我能够用我的optimzied函数矩阵乘法瓷砖的转置<一个href=\"http://stackoverflow.com/questions/21134279/difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp/21257842#21257842\">here几乎一模一样(我只是不得不改变code的一条线)。这里的主要功能是:

 重新排序(TMP,B,N2,BS);
对于(INT J = 0; J&LT; NB; J ++){
    OMP的#pragma为平行时间表(静态)NUM_THREADS(ncores)
    的for(int i =焦耳; I&LT; NB,我++){
        对于(INT K = 0; K&LT;Ĵ; k ++){
            产品(和B [步幅*(注:* J + K)]和B [步幅*(注:* I + K)]和B [步幅*(注:* I + J),BS);
        }
    }
    三角形(蓝调[步幅*(NB * J + j)条],BS);
    OMP的#pragma为平行时间表(静态)
    的for(int i = J + 1; I&LT; NB,我++){
        块(蓝调[步幅*(注* I + J),和B [步幅*(NB * J + j)条],BS);
    }
}
reorder_inverse(B,TMP,N2,BS);

下面是其他功能。我有SSE2,AVX和FMA六大产品功能,每双和float版本。我只显示了一个AVX和双重这里:

 模板&LT; typename的类型&GT;
空三角形(键入* A,INT N){
    对于(INT J = 0; J&LT; N; J ++){
        式S = 0;
        对于(INT K = 0; K&LT;Ĵ; k ++)S + = A [K * N + J] * A [K * N + J]。
        //如果((A [J * N + J] - S)小于0)printf的(asdf3Ĵ%D,%F%F \\ N,J,A [J * N + J] - S,开方(A [J * N + J] - S));
        一个[J * N + J] =开方(A [J * N + J] - S);
        事实上型= 1.0 / A [J * N + J]。
        的for(int i = J + 1; I&LT; N;我++){
            式S = 0;
            对于(INT K = 0; K&LT;Ĵ; k ++)S + = A [K * N + I] * A [K * N + J]。
            一个[J * N + I] =事实上*(A [J * N + I] - S);
        }
    }
}模板&LT; typename的类型&GT;
无效块(类型* A,类型* B,INT N){
    对于(INT J = 0; J&LT; N; J ++){
        事实上型= 1.0 / B [J * N + J]。
        的for(int i = 0; I&LT; N;我++){
            式S = 0;
            对于(INT K = 0; K&LT;Ĵ; k ++){
                S + = A [K * N + I] * B [K * N + J]。
            }
            一个[J * N + I] =事实上*(A [J * N + I] - S);
        }
    }
}模板&LT; typename的类型&GT;
空重排序(类型* A,类型* B,INT N,INT BS){
    INT NB = N / BS;
    INT跨距= BS *学士学位;
    //的printf(%d个%D \\ n,BS,NB,步幅);
    OMP的#pragma为平行时间表(静态)
    的for(int i = 0; I&LT; NB,我++){
        对于(INT J = 0; J&LT; NB; J ++){
            为(中间体I2 = 0; I2&下; BS; I2 ++){
                为(中间体J2 = 0; J2&下; BS; J2 ++){
                    B〔步幅*(注* I + J)+ BS * J2 + 12] = A [N * BS * I + J * BS + N * I2 + J2]。
                }
            }
        }
    }
}模板&LT; typename的类型&GT;
无效reorder_inverse(类型* A,类型* B,INT N,INT BS){
    INT NB = N / BS;
    INT跨距= BS *学士学位;
    //的printf(%d个%D \\ n,BS,NB,步幅);
    OMP的#pragma为平行时间表(静态)
    的for(int i = 0; I&LT; NB,我++){
        对于(INT J = 0; J&LT; NB; J ++){
            为(中间体I2 = 0; I2&下; BS; I2 ++){
                为(中间体J2 = 0; J2&下; BS; J2 ++){
                    B〔N * BS * I + J * BS + N * I2 + J2] = A [步幅*(注* I + J)+ BS * J2 + 12];
                }
            }
        }
    }为externC无效product32x32_avx(双*一,双* B,双* C中,int n)的
{
    的for(int i = 0; I&LT; N;我++){
        __m256d T1 = _mm256_loadu_pd(和C〔Ⅰ* N + 0]);
        __m256d T2 = _mm256_loadu_pd(和C〔Ⅰ* N + 4]);
        __m256d T3 = _mm256_loadu_pd(和C〔Ⅰ* N + 8]);
        __m256d T4 = _mm256_loadu_pd(和C〔Ⅰ* N + 12]);
        __m256d T5 = _mm256_loadu_pd(和C [I * N + 16]);
        __m256d T6 = _mm256_loadu_pd(和C [I * N + 20]);
        __m256d T7 = _mm256_loadu_pd(和C [I * N + 24]);
        __m256d T8 = _mm256_loadu_pd(和C [I * N + 28]);
        对于(INT K = 0; K&LT; N; k ++){
            __m256d A1 = _mm256_set1_pd(一[K * N + I]);            __m256d B1 = _mm256_loadu_pd(和b [K * N + 0]);
            T1 = _mm256_sub_pd(T1,_mm256_mul_pd(A1,B1));            __m256d B2 = _mm256_loadu_pd(和b [K * N + 4]);
            T2 = _mm256_sub_pd(T2,_mm256_mul_pd(A1,B2));            __m256d B3 = _mm256_loadu_pd(和b [K * N + 8]);
            T3 = _mm256_sub_pd(T3,_mm256_mul_pd(A1,B3));            __m256d B4 = _mm256_loadu_pd(和b [K * N + 12]);
            T4 = _mm256_sub_pd(t4时,_mm256_mul_pd(A1,B4));            __m256d B5 = _mm256_loadu_pd(和b [K * N + 16]);
            T5 = _mm256_sub_pd(T5,_mm256_mul_pd(A1,B5));            __m256d B6 = _mm256_loadu_pd(和b [K * N + 20]);
            T6 = _mm256_sub_pd(T6,_mm256_mul_pd(A1,B6));            __m256d B7 = _mm256_loadu_pd(和b [K * N + 24]);
            T7 = _mm256_sub_pd(T7,_mm256_mul_pd(A1,B7));            __m256d B8 = _mm256_loadu_pd(和b [K * N + 28]);
            T8 = _mm256_sub_pd(T8,_mm256_mul_pd(A1,B8));
        }
        _mm256_storeu_pd(和C〔Ⅰ* N + 0],T1);
        _mm256_storeu_pd(和C〔Ⅰ* N + 4],T2);
        _mm256_storeu_pd(和C〔Ⅰ* N + 8],T3);
        _mm256_storeu_pd(和C〔Ⅰ* N + 12],T4);
        _mm256_storeu_pd(和C〔Ⅰ* N + 16],T5);
        _mm256_storeu_pd(和C〔Ⅰ* N + 20],T6);
        _mm256_storeu_pd(和C〔Ⅰ* N + 24],T7);
        _mm256_storeu_pd(和C〔Ⅰ* N + 28],T8);
    }
}

I have a project where we solve the inverse of large (over 3000x3000) positive definite dense matrices using Cholesky Decomposition. The project is in Java and we use are using the CERN Colt BLAS library. Profiling the code shows that the Cholesky decomposition is the bottleneck.

I decided to try and parallelize the Cholesky decomposition using OpenMP and use it as a DLL in Java (with JNA). I started with the Cholesky decomposition code in C from Rosetta Code.

What I noticed is that the values in a column except for the diagonal element are independent. So I decided to calculate the diagonal elements in serial and the rest of the values of the column in parallel. I also swapped the order of the loops so that the inner loop runs over the rows and the outer loop over the columns. The serial version is slightly slower than the one from RosettaCode but the parallel version is six times faster than the RosettaCode version on my 4 core (8 HT) system. Using the DLL in Java speeds up our results by six times as well. Here is the code:

double *cholesky(double *A, int n) {
    double *L = (double*)calloc(n * n, sizeof(double));
    if (L == NULL)
        exit(EXIT_FAILURE);

    for (int j = 0; j <n; j++) {            
        double s = 0;
        for (int k = 0; k < j; k++) {
            s += L[j * n + k] * L[j * n + k];
        }
        L[j * n + j] = sqrt(A[j * n + j] - s);
        #pragma omp parallel for
        for (int i = j+1; i <n; i++) {
            double s = 0;
            for (int k = 0; k < j; k++) {
                s += L[i * n + k] * L[j * n + k];
            }
            L[i * n + j] = (1.0 / L[j * n + j] * (A[i * n + j] - s));
        }
    }
    return L;
}

You can find the full code for testing this at http://coliru.stacked-crooked.com/a/6f5750c20d456da9

I initially thought that false sharing would be a problem when the remaining elements of a column were small compared to the number of threads but that does not seem to be the case. I tried

#pragma omp parallel for schedule(static, 8) // a cache line is 8 doubles

I have not found clear examples of how to parallelize Choleskey decomposition. I don't know if what I have done is ideal. For example, will it work well on a NUMA system?

Perhaps a tasked based approach is better in general? In slides 7-9 at http://courses.engr.illinois.edu/cs554/fa2013/notes/07_cholesky.pdf there is an example of parallel cholesky decomposition using "fine grained tasks". It's not clear to me how to implement this yet.

I have two questions, specific and general. Do you have any suggestions on how to improve my implementation of Cholesky Decomposition with OpenMP? Can you suggest a different implementation of Cholesky Decomposition with OpenMP e.g. with tasks?

Edit: as requested here is the AVX function I used to compute s. It did not help

double inner_sum_AVX(double *li, double *lj, int n) {
    __m256d s4;
    int i;
    double s;

    s4 = _mm256_set1_pd(0.0);
    for (i = 0; i < (n & (-4)); i+=4) {
        __m256d li4, lj4;
        li4 = _mm256_loadu_pd(&li[i]);
        lj4 = _mm256_loadu_pd(&lj[i]);
        s4 = _mm256_add_pd(_mm256_mul_pd(li4, lj4), s4);
    }
    double out[4];
    _mm256_storeu_pd(out, s4);
    s = out[0] + out[1] + out[2] + out[3];
    for(;i<n; i++) {
        s += li[i]*lj[i];
    }
    return s;
}

解决方案

I manged to get SIMD working with the Cholesky decomposition. I did this using loop tiling as I have used before in matrix multiplication. The solution was not trivial. Here are the times for a 5790x5790 matrix on my 4 core/ 8 HT Ivy Bridge system (eff = GFLOPS/(peak GFLOPS)):

double floating point peak GFLOPS 118.1
1 thread       time 36.32 s, GFLOPS  1.78, eff  1.5%
8 threads      time  7.99 s, GFLOPS  8.10, eff  6.9%
4 threads+AVX  time  1.36 s, GFLOPS 47.64, eff 40.3%
4 threads MKL  time  0.68 s, GFLOPS 95.14, eff 80.6% // from LAPACKE_dpotrf

single floating point peak GFLOPS 236.2
1 thread       time 33.88 s, GFLOPS  1.91, eff  0.8%
8 threads      time  4.74 s, GFLOPS 13.64, eff  5.8%
4 threads+AVX  time  0.78 s, GFLOPS 82.61, eff 35.0%

The new method is 25 times faster for double and 40 times faster for single. The efficiency is about 35-40% of the peak FLOPS now. With matrix multiplication I get up to 70% with AVX in my own code. I don't know what to expect from Cholesky decomposition. The algorithm is partially serial (when calculating the diagonal block, called triangle in my code below) unlike matrix multiplication.

Update: I'm within a factor for 2 of the MKL. I don't know if I should be proud of that or embarrassed by that but apparently my code can still be improved significantly. I found a PhD thesis on this which shows that my block algorithm is a common solution so I managed to reinvent the wheel.

I use 32x32 tiles for double and 64x64 tiles for float. I also reorder the memory for each tile to be contiguous and be its transpose. I defined a new matrix production function. Matrix multiplication is defined as:

C_i,j = A_i,k * B_k,j //sum over k

I realized that in the Cholesky algorithm there is something very similar

C_j,i = A_i,k * B_j,k //sum over k

By writing the transpose of the tiles I was able to use my optimzied function for matrix multiplication here almost exactly (I only had to change one line of code). Here is the main function:

reorder(tmp,B,n2,bs);
for(int j=0; j<nb; j++) {
    #pragma omp parallel for schedule(static) num_threads(ncores)
    for(int i=j; i<nb; i++) {
        for(int k=0; k<j; k++) {
            product(&B[stride*(nb*j+k)],&B[stride*(nb*i+k)],&B[stride*(nb*i+j)],bs);
        }
    }
    triangle(&B[stride*(nb*j+j)], bs);
    #pragma omp parallel for schedule(static)
    for(int i=j+1; i<nb; i++) {         
        block(&B[stride*(nb*i+j)],&B[stride*(nb*j+j)],bs);
    }           
}
reorder_inverse(B,tmp,n2,bs); 

Here are the other functions. I have six product functions for SSE2, AVX, and FMA each with double and float version. I only show the one for AVX and double here:

template <typename Type>
void triangle(Type *A, int n) {
    for (int j = 0; j < n; j++) {
        Type s = 0;
        for(int k=0; k<j; k++) s+= A[k*n+j]*A[k*n+j];
        //if((A[j * n + j] - s)<0) printf("asdf3 j %d, %f %f\n", j, A[j * n + j] - s, sqrt(A[j * n + j] - s));
        A[j*n+j] = sqrt(A[j*n+j] - s);
        Type fact = 1.0/A[j*n+j];
        for (int i = j+1; i<n; i++) {
            Type s = 0;
            for(int k=0; k<j; k++) s+=A[k*n+i]*A[k*n+j];
            A[j*n+i] = fact * (A[j*n+i] - s);
        }
    }
}

template <typename Type>
void block(Type *A, Type *B, int n) {   
    for (int j = 0; j <n; j++) {
        Type fact = 1.0/B[j*n+j];   
        for (int i = 0; i<n; i++) {
            Type s = 0;
            for(int k=0; k<j; k++) {
                s += A[k*n+i]*B[k*n+j];
            }
            A[j*n+i] = fact * (A[j*n+i] - s);
        }
    }
}

template <typename Type>
void reorder(Type *A, Type *B, int n, int bs) {
    int nb = n/bs;
    int stride = bs*bs;
    //printf("%d %d %d\n", bs, nb, stride);
    #pragma omp parallel for schedule(static)
    for(int i=0; i<nb; i++) {
        for(int j=0; j<nb; j++) {
            for(int i2=0; i2<bs; i2++) {
                for(int j2=0; j2<bs; j2++) {
                    B[stride*(nb*i+j) + bs*j2+i2] = A[n*bs*i + j*bs + n*i2 + j2];
                }
            }
        }
    }
}

template <typename Type>
void reorder_inverse(Type *A, Type *B, int n, int bs) {
    int nb = n/bs;
    int stride = bs*bs;
    //printf("%d %d %d\n", bs, nb, stride);
    #pragma omp parallel for schedule(static)
    for(int i=0; i<nb; i++) {
        for(int j=0; j<nb; j++) {
            for(int i2=0; i2<bs; i2++) {
                for(int j2=0; j2<bs; j2++) {
                    B[n*bs*i + j*bs + n*i2 + j2] = A[stride*(nb*i+j) + bs*j2+i2];
                }
            }
        }
    }

extern "C" void product32x32_avx(double *a, double *b, double *c, int n) 
{
    for(int i=0; i<n; i++) {    
        __m256d t1 = _mm256_loadu_pd(&c[i*n +  0]);
        __m256d t2 = _mm256_loadu_pd(&c[i*n +  4]);
        __m256d t3 = _mm256_loadu_pd(&c[i*n +  8]);
        __m256d t4 = _mm256_loadu_pd(&c[i*n + 12]);
        __m256d t5 = _mm256_loadu_pd(&c[i*n + 16]);
        __m256d t6 = _mm256_loadu_pd(&c[i*n + 20]);
        __m256d t7 = _mm256_loadu_pd(&c[i*n + 24]);
        __m256d t8 = _mm256_loadu_pd(&c[i*n + 28]);
        for(int k=0; k<n; k++) {
            __m256d a1 = _mm256_set1_pd(a[k*n+i]);

            __m256d b1 = _mm256_loadu_pd(&b[k*n+0]);
            t1 = _mm256_sub_pd(t1,_mm256_mul_pd(a1,b1));

            __m256d b2 = _mm256_loadu_pd(&b[k*n+4]);
            t2 = _mm256_sub_pd(t2,_mm256_mul_pd(a1,b2));

            __m256d b3 = _mm256_loadu_pd(&b[k*n+8]);
            t3 = _mm256_sub_pd(t3,_mm256_mul_pd(a1,b3));

            __m256d b4 = _mm256_loadu_pd(&b[k*n+12]);
            t4 = _mm256_sub_pd(t4,_mm256_mul_pd(a1,b4));

            __m256d b5 = _mm256_loadu_pd(&b[k*n+16]);
            t5 = _mm256_sub_pd(t5,_mm256_mul_pd(a1,b5));

            __m256d b6 = _mm256_loadu_pd(&b[k*n+20]);
            t6 = _mm256_sub_pd(t6,_mm256_mul_pd(a1,b6));

            __m256d b7 = _mm256_loadu_pd(&b[k*n+24]);
            t7 = _mm256_sub_pd(t7,_mm256_mul_pd(a1,b7));

            __m256d b8 = _mm256_loadu_pd(&b[k*n+28]);
            t8 = _mm256_sub_pd(t8,_mm256_mul_pd(a1,b8));
        }
        _mm256_storeu_pd(&c[i*n +  0], t1);
        _mm256_storeu_pd(&c[i*n +  4], t2);
        _mm256_storeu_pd(&c[i*n +  8], t3);
        _mm256_storeu_pd(&c[i*n + 12], t4);
        _mm256_storeu_pd(&c[i*n + 16], t5);
        _mm256_storeu_pd(&c[i*n + 20], t6);
        _mm256_storeu_pd(&c[i*n + 24], t7);
        _mm256_storeu_pd(&c[i*n + 28], t8);
    }
}

这篇关于使用OpenMP Cholesky分解的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆