C语言中3D直接卷积实现的优化 [英] Optimization of 3D Direct Convolution Implementation in C

查看:67
本文介绍了C语言中3D直接卷积实现的优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我的项目,我已经编写了直接3D卷积的天真的C实现,并在输入上进行了定期填充.不幸的是,由于我是C语言的新手,因此性能不是很好...这是代码:

  int mod(int a,int b){//计算mod以获取定期填充的正确索引int r = a%b;返回r<0?r + b:r;}void convolve3D(const double * image,const double * kernel,const int imageDimX,const int imageDimY,const int imageDimZ,const int stencilDimX,const int stencilDimY,const int stencilDimZ,double *结果){int imageSize = imageDimX * imageDimY * imageDimZ;int kernelSize = kernelDimX * kernelDimY * kernelDimZ;int i,j,k,l,m,n;int kernelCenterX =(kernelDimX-1)/2;int kernelCenterY =(kernelDimY-1)/2;int kernelCenterZ =(kernelDimZ-1)/2;int xShift,yShift,zShift;int outIndex,outI,outJ,outK;int imageIndex = 0,kernelIndex = 0;//遍历每个体素对于(k = 0; k< imageDimZ; k ++){对于(j = 0; j< imageDimY; j ++){对于(i = 0; i< imageDimX; i ++){stencilIndex = 0;//对于每个体素,遍历每个内核系数for(n = 0; n< kernelDimZ; n ++){对于(m = 0; m< kernelDimY; m ++){对于(l = 0; l< kernelDimX; l ++){//在输出图像中找到相应体素的索引xShift = l-kernelCenterX;yShift = m-kernelCenterY;zShift = n-kernelCenterZ;outI = mod((i-xShift),imageDimX);outJ = mod((j-yShift),imageDimY);outK = mod((k-zShift),imageDimZ);outIndex = outK * imageDimX * imageDimY + outJ * imageDimX + outI;//计算并添加result [outIndex] + = stencil [stencilIndex] * image [imageIndex];stencilIndex ++;}}}imageIndex ++;}}}} 

  • 按照惯例,所有矩阵(图像,内核,结果)都以列优先的方式存储,这就是为什么我以这种方式循环遍历它们,以便它们在内存中更近(听说这样做会有所帮助).

我知道实现非常幼稚,但是由于它是用C编写的,所以我希望性能会很好,但是有点令人失望.我用大小为100 ^ 3的映像和大小为10 ^ 3的内核(如果仅算乘和加,则总数为〜1GFLOPS)进行了测试,大约用了7s,我认为这远低于典型CPU的能力./p>

如果可能,你们可以帮助我优化此例程吗?我乐于接受任何有帮助的事情,如果您可以考虑的话,请注意以下几点:

  1. 我正在处理的问题可能很大(例如,图像尺寸为200 x 200 x 200,内核尺寸为50 x 50 x 50甚至更大).我知道优化此问题的一种方法是将这个问题转换为矩阵乘法问题,并使用blas GEMM例程,但是恐怕内存无法容纳这么大的矩阵

  2. 由于问题的性质,我宁愿直接卷积而不是FFTConvolve,因为我的模型是在考虑直接卷积的基础上开发的,而我对FFT卷积的印象是,它产生的结果与直接卷积略有不同特别是对于快速变化的图像,我正努力避免出现差异.就是说,我绝不是专家.因此,如果您有一个基于FFT卷积的出色实现,并且/或者我对FFT卷积的印象是完全有偏见的,那么如果您能帮助我,我将不胜感激.

  3. 假设输入图像是周期性的,因此需要定期填充

  4. 我知道利用blas/SIMD或其他较低级别的方法肯定会在这里大有帮助.但是由于我是这里的新手,所以我真的不知道从哪里开始.如果您在这些库中有丰富的经验,那么如果您能帮助我指出正确的方向,我将不胜感激,

非常感谢您的帮助,如果您需要更多有关问题性质的信息,请告诉我

解决方案

第一步,将您的 mod((i-xShift),imageDimX)替换为以下内容:

 内联int钳位(int x,int size){if(x< 0)返回x +大小;if(x> = size)返回x-size;返回x;} 

这些分支是非常可预测的,因为对于大量连续元素,它们会产生相同的结果.整数取模相对较慢.

现在,下一步(按成本/利润排序)将并行化.如果您有任何现代C ++编译器,只需在项目设置中的某处启用OpenMP.之后,您需要进行2次更改.

  1. 用如下代码装饰您的外部循环: #pragma omp parallel for schedule(引导)
  2. 在该循环内移动功能级别的变量.这也意味着您必须为每次迭代从 k 计算初始的 imageIndex .

下一个选项,重新编写代码,以便每个输出值只写入一次.在最内层的3个循环中计算最终值,从映像和内核的随机位置读取数据,并且只将结果写入一次.当您在内部循环中具有 result [outIndex] + = 时,CPU将停顿以等待内存中的数据.当您累积的变量不是寄存器而不是内存时,就没有访问延迟.

SIMD是最复杂的优化.简而言之,您将需要硬件具有的FMA的最大宽度(如果您具有AVX并需要双精度,则该宽度为4),并且还需要为3个最里面的循环使用多个独立的累加器,以避免碰到延迟而不是使吞吐量饱和.这是我对更简单问题的回答,作为我的意思.

For my project, I've written a naive C implementation of direct 3D convolution with periodic padding on the input. Unfortunately, since I'm new to C, the performance isn't so good... here's the code:

int mod(int a, int b)
{
    // calculate mod to get the correct index with periodic padding
    int r = a % b;
    return r < 0 ? r + b : r;
}
void convolve3D(const double *image, const double *kernel, const int imageDimX, const int imageDimY, const int imageDimZ, const int stencilDimX, const int stencilDimY, const int stencilDimZ, double *result)
{
    int imageSize = imageDimX * imageDimY * imageDimZ;
    int kernelSize = kernelDimX * kernelDimY * kernelDimZ;

    int i, j, k, l, m, n;
    int kernelCenterX = (kernelDimX - 1) / 2;
    int kernelCenterY = (kernelDimY - 1) / 2;
    int kernelCenterZ = (kernelDimZ - 1) / 2;
    int xShift,yShift,zShift;
    int outIndex, outI, outJ, outK;
    int imageIndex = 0, kernelIndex = 0;
    
    // Loop through each voxel
    for (k = 0; k < imageDimZ; k++){
        for ( j = 0; j < imageDimY; j++) {
            for ( i = 0; i < imageDimX; i++) {
                stencilIndex = 0;
                // for each voxel, loop through each kernel coefficient
                for (n = 0; n < kernelDimZ; n++){
                    for ( m = 0; m < kernelDimY; m++) {
                        for ( l = 0; l < kernelDimX; l++) {
                            // find the index of the corresponding voxel in the output image
                            xShift = l - kernelCenterX;
                            yShift = m - kernelCenterY;
                            zShift = n - kernelCenterZ;

                            outI = mod ((i - xShift), imageDimX);
                            outJ = mod ((j - yShift), imageDimY);
                            outK = mod ((k - zShift), imageDimZ);
                            
                            outIndex = outK * imageDimX * imageDimY + outJ * imageDimX + outI;

                            // calculate and add
                            result[outIndex] += stencil[stencilIndex]* image[imageIndex];
                            stencilIndex++;
                        }
                    }
                } 
                imageIndex ++;
            }
        }
    } 
}

  • by convention, all the matrices (image, kernel, result) are stored in column-major fashion, and that's why I loop through them in such way so they are closer in memory (heard this would help).

I know the implementation is very naive, but since it's written in C, I was hoping the performance would be good, but instead it's a little disappointing. I tested it with image of size 100^3 and kernel of size 10^3 (Total ~1GFLOPS if only count the multiplication and addition), and it took ~7s, which I believe is way below the capability of a typical CPU.

If possible, could you guys help me optimize this routine? I'm open to anything that could help, with just a few things if you could consider:

  1. The problem I'm working with could be big (e.g. image of size 200 by 200 by 200 with kernel of size 50 by 50 by 50 or even larger). I understand that one way of optimizing this is by converting this problem into a matrix multiplication problem and use the blas GEMM routine, but I'm afraid memory could not hold such a big matrix

  2. Due to the nature of the problem, I would prefer direct convolution instead of FFTConvolve, since my model is developed with direct convolution in mind, and my impression of FFT convolve is that it gives slightly different result than direct convolve especially for rapidly changing image, a discrepancy I'm trying to avoid. That said, I'm in no way an expert in this. so if you have a great implementation based on FFTconvolve and/or my impression on FFT convolve is totally biased, I would really appreciate if you could help me out.

  3. The input images are assumed to be periodic, so periodic padding is necessary

  4. I understand that utilizing blas/SIMD or other lower level ways would definitely help a lot here. but since I'm a newbie here I dont't really know where to start... I would really appreciate if you help pointing me to the right direction if you have experience in these libraries,

Thanks a lot for your help, and please let me know if you need more info about the nature of the problem

解决方案

As a first step, replace your mod ((i - xShift), imageDimX) with something like this:

inline int clamp( int x, int size )
{
    if( x < 0 ) return x + size;
    if( x >= size ) return x - size;
    return x;
}

These branches are very predictable because they yield same results for very large count of consecutive elements. Integer modulo is relatively slow.

Now, next step (ordered by cost/profit) is going to be parallelizing. If you have any modern C++ compiler, just enable OpenMP somewhere in project settings. After that you need 2 changes.

  1. Decorate your very outer loop with something like this: #pragma omp parallel for schedule(guided)
  2. Move your function-level variables within that loop. This also means you’ll have to compute initial imageIndex from your k, for each iteration.

Next option, rework your code so you only write each output value once. Compute the final value in your innermost 3 loops, reading from random locations from both image and kernel, and only write the result once. When you have that result[outIndex] += in the inner loop, CPU stalls waiting for the data from memory. When you accumulate in a variable that’s a register not memory, there’s no access latency.

SIMD is the most complicated optimization for that. But in short, you’ll need maximum width of the FMA your hardware has (if you have AVX and need double precision, that width is 4), and you’ll also need multiple independent accumulators for your 3 innermost loops, to avoid hitting the latency as opposed to saturating the throughput. Here’s my answer to much easier problem as an example what I mean.

这篇关于C语言中3D直接卷积实现的优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆