使用 OpenMP 进行 Cholesky 分解 [英] Cholesky decomposition with OpenMP

查看:26
本文介绍了使用 OpenMP 进行 Cholesky 分解的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个项目,我们使用 Cholesky 分解求解大型(超过 3000x3000)正定密集矩阵的逆矩阵.该项目使用 Java,我们使用的是 CERN Colt BLAS 库.分析代码表明 Cholesky 分解是瓶颈.

I have a project where we solve the inverse of large (over 3000x3000) positive definite dense matrices using Cholesky Decomposition. The project is in Java and we use are using the CERN Colt BLAS library. Profiling the code shows that the Cholesky decomposition is the bottleneck.

我决定尝试使用 OpenMP 并行化 Cholesky 分解,并将其用作 Java 中的 DLL(使用 JNA).我从 Rosetta Code 中的 C 中 Cholesky 分解代码开始.

I decided to try and parallelize the Cholesky decomposition using OpenMP and use it as a DLL in Java (with JNA). I started with the Cholesky decomposition code in C from Rosetta Code.

我注意到除了对角线元素之外的列中的值是独立的.所以我决定串行计算对角线元素,并行计算列的其余值.我还交换了循环的顺序,以便内循环运行在行上,外循环运行在列上.串行版本比 RosettaCode 稍慢但并行版本比我的 4 核 (8 HT) 系统上的 RosettaCode 版本快 6 倍.在 Java 中使用 DLL 可以加快我们的结果也是六次.代码如下:

What I noticed is that the values in a column except for the diagonal element are independent. So I decided to calculate the diagonal elements in serial and the rest of the values of the column in parallel. I also swapped the order of the loops so that the inner loop runs over the rows and the outer loop over the columns. The serial version is slightly slower than the one from RosettaCode but the parallel version is six times faster than the RosettaCode version on my 4 core (8 HT) system. Using the DLL in Java speeds up our results by six times as well. Here is the code:

double *cholesky(double *A, int n) {
    double *L = (double*)calloc(n * n, sizeof(double));
    if (L == NULL)
        exit(EXIT_FAILURE);

    for (int j = 0; j <n; j++) {            
        double s = 0;
        for (int k = 0; k < j; k++) {
            s += L[j * n + k] * L[j * n + k];
        }
        L[j * n + j] = sqrt(A[j * n + j] - s);
        #pragma omp parallel for
        for (int i = j+1; i <n; i++) {
            double s = 0;
            for (int k = 0; k < j; k++) {
                s += L[i * n + k] * L[j * n + k];
            }
            L[i * n + j] = (1.0 / L[j * n + j] * (A[i * n + j] - s));
        }
    }
    return L;
}

您可以在 http://coliru.stacked-crooked 找到用于测试的完整代码.com/a/6f5750c20d456da9

我最初认为当列的剩余元素与线程数相比较小时,错误共享会是一个问题,但似乎并非如此.我试过了

I initially thought that false sharing would be a problem when the remaining elements of a column were small compared to the number of threads but that does not seem to be the case. I tried

#pragma omp parallel for schedule(static, 8) // a cache line is 8 doubles

我还没有找到有关如何并行化 Choleskey 分解的明确示例.我不知道我所做的是否理想.例如,它会在 NUMA 系统上运行良好吗?

I have not found clear examples of how to parallelize Choleskey decomposition. I don't know if what I have done is ideal. For example, will it work well on a NUMA system?

也许基于任务的方法通常更好?在 http://courses.engr.illinois.edu 的幻灯片 7-9 中/cs554/fa2013/notes/07_cholesky.pdf 有一个使用细粒度任务"的并行 cholesky 分解示例.我还不清楚如何实现这一点.

Perhaps a tasked based approach is better in general? In slides 7-9 at http://courses.engr.illinois.edu/cs554/fa2013/notes/07_cholesky.pdf there is an example of parallel cholesky decomposition using "fine grained tasks". It's not clear to me how to implement this yet.

我有两个问题,具体的和一般的.您对如何使用 OpenMP 改进我的 Cholesky 分解实现有任何建议吗?您能否建议使用 OpenMP 进行 Cholesky 分解的不同实现,例如有任务吗?

I have two questions, specific and general. Do you have any suggestions on how to improve my implementation of Cholesky Decomposition with OpenMP? Can you suggest a different implementation of Cholesky Decomposition with OpenMP e.g. with tasks?

这里要求的是我用来计算 s 的 AVX 函数.它没有帮助

as requested here is the AVX function I used to compute s. It did not help

double inner_sum_AVX(double *li, double *lj, int n) {
    __m256d s4;
    int i;
    double s;

    s4 = _mm256_set1_pd(0.0);
    for (i = 0; i < (n & (-4)); i+=4) {
        __m256d li4, lj4;
        li4 = _mm256_loadu_pd(&li[i]);
        lj4 = _mm256_loadu_pd(&lj[i]);
        s4 = _mm256_add_pd(_mm256_mul_pd(li4, lj4), s4);
    }
    double out[4];
    _mm256_storeu_pd(out, s4);
    s = out[0] + out[1] + out[2] + out[3];
    for(;i<n; i++) {
        s += li[i]*lj[i];
    }
    return s;
}

推荐答案

我设法让 SIMD 与 Cholesky 分解一起工作.我使用循环平铺来做到这一点,就像我之前在矩阵乘法中使用的那样.解决方案并非易事.以下是我的 4 核/8 HT Ivy Bridge 系统上 5790x5790 矩阵的时间(eff = GFLOPS/(峰值 GFLOPS)):

I manged to get SIMD working with the Cholesky decomposition. I did this using loop tiling as I have used before in matrix multiplication. The solution was not trivial. Here are the times for a 5790x5790 matrix on my 4 core/ 8 HT Ivy Bridge system (eff = GFLOPS/(peak GFLOPS)):

double floating point peak GFLOPS 118.1
1 thread       time 36.32 s, GFLOPS  1.78, eff  1.5%
8 threads      time  7.99 s, GFLOPS  8.10, eff  6.9%
4 threads+AVX  time  1.36 s, GFLOPS 47.64, eff 40.3%
4 threads MKL  time  0.68 s, GFLOPS 95.14, eff 80.6% // from LAPACKE_dpotrf

single floating point peak GFLOPS 236.2
1 thread       time 33.88 s, GFLOPS  1.91, eff  0.8%
8 threads      time  4.74 s, GFLOPS 13.64, eff  5.8%
4 threads+AVX  time  0.78 s, GFLOPS 82.61, eff 35.0%

新方法的双倍速度快 25 倍,单倍速度快 40 倍.现在的效率约为峰值 FLOPS 的 35-40%.通过矩阵乘法,我在自己的代码中使用 AVX 最多可以得到 70%.我不知道从 Cholesky 分解中会得到什么.与矩阵乘法不同,该算法是部分串行的(在计算对角线块时,在下面的代码中称为 triangle).

The new method is 25 times faster for double and 40 times faster for single. The efficiency is about 35-40% of the peak FLOPS now. With matrix multiplication I get up to 70% with AVX in my own code. I don't know what to expect from Cholesky decomposition. The algorithm is partially serial (when calculating the diagonal block, called triangle in my code below) unlike matrix multiplication.

更新:我在 MKL 的 2 倍以内.我不知道我是否应该为此感到自豪或尴尬,但显然我的代码仍然可以得到显着改进.我在这方面找到了一篇博士论文,这表明我的区块算法是一种常见的解决方案,所以我设法重新发明轮子.

Update: I'm within a factor for 2 of the MKL. I don't know if I should be proud of that or embarrassed by that but apparently my code can still be improved significantly. I found a PhD thesis on this which shows that my block algorithm is a common solution so I managed to reinvent the wheel.

我使用 32x32 磁贴作为双精度磁贴,使用 64x64 磁贴作为浮动磁贴.我还重新排列每个图块的内存以使其连续并转置.我定义了一个新的矩阵生产函数.矩阵乘法定义为:

I use 32x32 tiles for double and 64x64 tiles for float. I also reorder the memory for each tile to be contiguous and be its transpose. I defined a new matrix production function. Matrix multiplication is defined as:

C_i,j = A_i,k * B_k,j //sum over k

我意识到在 Cholesky 算法中有一些非常相似的东西

I realized that in the Cholesky algorithm there is something very similar

C_j,i = A_i,k * B_j,k //sum over k

通过编写瓦片的转置,我能够使用我的优化函数进行矩阵乘法 这里 几乎完全一样(我只需要更改一行代码).这是主要功能:

By writing the transpose of the tiles I was able to use my optimzied function for matrix multiplication here almost exactly (I only had to change one line of code). Here is the main function:

reorder(tmp,B,n2,bs);
for(int j=0; j<nb; j++) {
    #pragma omp parallel for schedule(static) num_threads(ncores)
    for(int i=j; i<nb; i++) {
        for(int k=0; k<j; k++) {
            product(&B[stride*(nb*j+k)],&B[stride*(nb*i+k)],&B[stride*(nb*i+j)],bs);
        }
    }
    triangle(&B[stride*(nb*j+j)], bs);
    #pragma omp parallel for schedule(static)
    for(int i=j+1; i<nb; i++) {         
        block(&B[stride*(nb*i+j)],&B[stride*(nb*j+j)],bs);
    }           
}
reorder_inverse(B,tmp,n2,bs); 

这里是其他功能.我有六个用于 SSE2、AVX 和 FMA 的产品函数,每个函数都有双浮点版本.我只显示 AVX 的一个,并在此处加倍:

Here are the other functions. I have six product functions for SSE2, AVX, and FMA each with double and float version. I only show the one for AVX and double here:

template <typename Type>
void triangle(Type *A, int n) {
    for (int j = 0; j < n; j++) {
        Type s = 0;
        for(int k=0; k<j; k++) s+= A[k*n+j]*A[k*n+j];
        //if((A[j * n + j] - s)<0) printf("asdf3 j %d, %f %f
", j, A[j * n + j] - s, sqrt(A[j * n + j] - s));
        A[j*n+j] = sqrt(A[j*n+j] - s);
        Type fact = 1.0/A[j*n+j];
        for (int i = j+1; i<n; i++) {
            Type s = 0;
            for(int k=0; k<j; k++) s+=A[k*n+i]*A[k*n+j];
            A[j*n+i] = fact * (A[j*n+i] - s);
        }
    }
}

template <typename Type>
void block(Type *A, Type *B, int n) {   
    for (int j = 0; j <n; j++) {
        Type fact = 1.0/B[j*n+j];   
        for (int i = 0; i<n; i++) {
            Type s = 0;
            for(int k=0; k<j; k++) {
                s += A[k*n+i]*B[k*n+j];
            }
            A[j*n+i] = fact * (A[j*n+i] - s);
        }
    }
}

template <typename Type>
void reorder(Type *A, Type *B, int n, int bs) {
    int nb = n/bs;
    int stride = bs*bs;
    //printf("%d %d %d
", bs, nb, stride);
    #pragma omp parallel for schedule(static)
    for(int i=0; i<nb; i++) {
        for(int j=0; j<nb; j++) {
            for(int i2=0; i2<bs; i2++) {
                for(int j2=0; j2<bs; j2++) {
                    B[stride*(nb*i+j) + bs*j2+i2] = A[n*bs*i + j*bs + n*i2 + j2];
                }
            }
        }
    }
}

template <typename Type>
void reorder_inverse(Type *A, Type *B, int n, int bs) {
    int nb = n/bs;
    int stride = bs*bs;
    //printf("%d %d %d
", bs, nb, stride);
    #pragma omp parallel for schedule(static)
    for(int i=0; i<nb; i++) {
        for(int j=0; j<nb; j++) {
            for(int i2=0; i2<bs; i2++) {
                for(int j2=0; j2<bs; j2++) {
                    B[n*bs*i + j*bs + n*i2 + j2] = A[stride*(nb*i+j) + bs*j2+i2];
                }
            }
        }
    }

extern "C" void product32x32_avx(double *a, double *b, double *c, int n) 
{
    for(int i=0; i<n; i++) {    
        __m256d t1 = _mm256_loadu_pd(&c[i*n +  0]);
        __m256d t2 = _mm256_loadu_pd(&c[i*n +  4]);
        __m256d t3 = _mm256_loadu_pd(&c[i*n +  8]);
        __m256d t4 = _mm256_loadu_pd(&c[i*n + 12]);
        __m256d t5 = _mm256_loadu_pd(&c[i*n + 16]);
        __m256d t6 = _mm256_loadu_pd(&c[i*n + 20]);
        __m256d t7 = _mm256_loadu_pd(&c[i*n + 24]);
        __m256d t8 = _mm256_loadu_pd(&c[i*n + 28]);
        for(int k=0; k<n; k++) {
            __m256d a1 = _mm256_set1_pd(a[k*n+i]);

            __m256d b1 = _mm256_loadu_pd(&b[k*n+0]);
            t1 = _mm256_sub_pd(t1,_mm256_mul_pd(a1,b1));

            __m256d b2 = _mm256_loadu_pd(&b[k*n+4]);
            t2 = _mm256_sub_pd(t2,_mm256_mul_pd(a1,b2));

            __m256d b3 = _mm256_loadu_pd(&b[k*n+8]);
            t3 = _mm256_sub_pd(t3,_mm256_mul_pd(a1,b3));

            __m256d b4 = _mm256_loadu_pd(&b[k*n+12]);
            t4 = _mm256_sub_pd(t4,_mm256_mul_pd(a1,b4));

            __m256d b5 = _mm256_loadu_pd(&b[k*n+16]);
            t5 = _mm256_sub_pd(t5,_mm256_mul_pd(a1,b5));

            __m256d b6 = _mm256_loadu_pd(&b[k*n+20]);
            t6 = _mm256_sub_pd(t6,_mm256_mul_pd(a1,b6));

            __m256d b7 = _mm256_loadu_pd(&b[k*n+24]);
            t7 = _mm256_sub_pd(t7,_mm256_mul_pd(a1,b7));

            __m256d b8 = _mm256_loadu_pd(&b[k*n+28]);
            t8 = _mm256_sub_pd(t8,_mm256_mul_pd(a1,b8));
        }
        _mm256_storeu_pd(&c[i*n +  0], t1);
        _mm256_storeu_pd(&c[i*n +  4], t2);
        _mm256_storeu_pd(&c[i*n +  8], t3);
        _mm256_storeu_pd(&c[i*n + 12], t4);
        _mm256_storeu_pd(&c[i*n + 16], t5);
        _mm256_storeu_pd(&c[i*n + 20], t6);
        _mm256_storeu_pd(&c[i*n + 24], t7);
        _mm256_storeu_pd(&c[i*n + 28], t8);
    }
}

这篇关于使用 OpenMP 进行 Cholesky 分解的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆