C中的快速2D卷积 [英] Fast 2D Convolution in C

查看:145
本文介绍了C中的快速2D卷积的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在Python中实现卷积神经网络.最初,我使用scipy.signal的convolve2d函数进行卷积,但是它有很多开销,并且仅使用C实现我自己的算法并从python调用它会更快,因为我知道我的输入是什么样子

我已经实现了2个功能:

  1. 将矩阵与不可分离的内核卷积
  2. 对具有可分离内核的矩阵进行卷积(到目前为止,我假设python在将其传递给C之前进行了排名检查和拆分)

由于我需要降低尺寸,因此这些功能都没有填充.

不可分割的2D卷积

// a - 2D matrix (as a 1D array), w - kernel
double* conv2(double* a, double* w, double* result)
{
    register double acc;
    register int i; 
    register int j;
    register int k1, k2;
    register int l1, l2;
    register int t1, t2;

    for(i = 0; i < RESULT_DIM; i++) 
    {
        t1 = i * RESULT_DIM; // loop invariants
        for(j = 0; j < RESULT_DIM; j++) 
        {   
            acc = 0.0;
            for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
            {
                t2 = k1 * FILTER_DIM;  // loop invariants
                for(l1 = FILTER_DIM - 1, l2 = 0; l1 >= 0; l1--, l2++)
                {
                    acc += w[t2 + l1] * a[(i + k2) * IMG_DIM + (j + l2)];
                }
            }
            result[t1 + j] = acc;
        }
    }

    return result;
}

可分离的2D卷积

// a - 2D matrix, w1, w2 - the separated 1D kernels
double* conv2sep(double* a, double* w1, double* w2, double* result)
{
    register double acc;
    register int i; 
    register int j;
    register int k1, k2;
    register int t;
    double* tmp = (double*)malloc(IMG_DIM * RESULT_DIM * sizeof(double));

    for(i = 0; i < RESULT_DIM; i++) // convolve with w1 
    {
        t = i * RESULT_DIM;
        for(j = 0; j < IMG_DIM; j++)
        {
            acc = 0.0;
            for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
            {
                acc += w1[k1] * a[k2 * IMG_DIM + t + j];
            }
            tmp[t + j] = acc;
        }
    }

    for(i = 0; i < RESULT_DIM; i++) // convolve with w2
    {
        t = i * RESULT_DIM;
        for(j = 0; j < RESULT_DIM; j++)
        {
            acc = 0.0;
            for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
            {
                acc += w2[k1] * tmp[t + (j + k2)];
            }

            result[t + j] = acc;
        }
    }

    free(tmp);
    return result;
}

使用gcc的-O3标志进行编译,并使用4000x4000矩阵和5x5内核在2.7GHz Intel i7上进行测试,我分别得到(平均5):

271.21900 ms
127.32000 ms

与scipy.signal的convolve2d相比,这仍然是一个相当大的改进,同一操作大约需要2秒钟,但是我需要更高的速度,因为我将多次调用此函数.将数据类型更改为浮点型目前还不是一个选择,即使这样做会大大提高速度.

有没有一种方法可以进一步优化这些算法?我可以应用任何缓存技巧或例程来加快速度吗?

任何建议将不胜感激.

解决方案

如果仅在x86上运行,则可以考虑使用SSE或AVX SIMD优化.对于double数据,吞吐率的提高将是适度的,但是如果您可以切换到float,则使用SSE可以达到4倍的提高,而使用AVX可以达到8倍的提高.关于StackOverflow这个主题的问题和答案已经很多,您可能可以从中获得有关实现的一些想法.可替代地,还有许多可用的库,它们包括高性能的2D卷积(滤波)例程,并且这些库通常利用SIMD来提高性能,例如.英特尔的IPP(商业)或OpenCV(免费).

另一种可能性是利用多个内核-将映像拆分为多个块,然后在其自己的线程中运行每个块.例如.如果您有4核CPU,则将映像分成4个块. (请参阅 pthreads ).

如果您确实想完全优化此操作,那么您当然可以将上述两个想法结合起来.


您可以将一些小的优化应用于当前代码以及将来的任何实现(例如SIMD):

  • 如果您的内核是对称的(或奇数对称的),则可以通过添加(减去)对称的输入值并执行一次乘法而不是两次来减少运算次数

  • 对于可分离的情况,
  • 而不是分配全帧临时缓冲区,请考虑使用条带挖掘"方法-分配较小的缓冲区,该缓冲区为全宽度,但行数相对较少,然后进行处理您的图片以条纹"形式出现,交替应用水平内核和垂直内核.这样做的好处是您将拥有更多的缓存友好型访问模式和更小的内存空间.


关于编码风格的一些评论:

  • register关键字已经有很多年了,如果您尝试使用它,现代编译器会发出警告-通过放弃它可以节省一些杂音(和一些输入内容)

  • 在C语言中广播malloc的结果是令人讨厌的-这是冗余且有潜在危险. /p>

  • 将任何输入参数设置为const(即只读),并对永不使用别名的任何参数(例如aresult)使用restrict-这不仅有助于避免编程错误(至少在const情况下),但是在某些情况下,它可以帮助编译器生成更好的优化代码(尤其是在可能出现别名的情况下).

I'm trying to implement a convolutional neural network in Python. Originally, I was using scipy.signal's convolve2d function to do the convolution, but it has a lot of overhead, and it would be faster to just implement my own algorithm in C and call it from python, since I know what my input looks like.

I've implemented 2 functions:

  1. Convolving a matrix with a non-separable kernel
  2. Convolving a matrix with a separable kernel (For now I've assumed python does the rank checking and splitting before passing it onto C)

Neither of these functions has padding since I require dimensionality reduction.

Non-separable 2D Convolution

// a - 2D matrix (as a 1D array), w - kernel
double* conv2(double* a, double* w, double* result)
{
    register double acc;
    register int i; 
    register int j;
    register int k1, k2;
    register int l1, l2;
    register int t1, t2;

    for(i = 0; i < RESULT_DIM; i++) 
    {
        t1 = i * RESULT_DIM; // loop invariants
        for(j = 0; j < RESULT_DIM; j++) 
        {   
            acc = 0.0;
            for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
            {
                t2 = k1 * FILTER_DIM;  // loop invariants
                for(l1 = FILTER_DIM - 1, l2 = 0; l1 >= 0; l1--, l2++)
                {
                    acc += w[t2 + l1] * a[(i + k2) * IMG_DIM + (j + l2)];
                }
            }
            result[t1 + j] = acc;
        }
    }

    return result;
}

Separable 2D Convolution

// a - 2D matrix, w1, w2 - the separated 1D kernels
double* conv2sep(double* a, double* w1, double* w2, double* result)
{
    register double acc;
    register int i; 
    register int j;
    register int k1, k2;
    register int t;
    double* tmp = (double*)malloc(IMG_DIM * RESULT_DIM * sizeof(double));

    for(i = 0; i < RESULT_DIM; i++) // convolve with w1 
    {
        t = i * RESULT_DIM;
        for(j = 0; j < IMG_DIM; j++)
        {
            acc = 0.0;
            for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
            {
                acc += w1[k1] * a[k2 * IMG_DIM + t + j];
            }
            tmp[t + j] = acc;
        }
    }

    for(i = 0; i < RESULT_DIM; i++) // convolve with w2
    {
        t = i * RESULT_DIM;
        for(j = 0; j < RESULT_DIM; j++)
        {
            acc = 0.0;
            for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
            {
                acc += w2[k1] * tmp[t + (j + k2)];
            }

            result[t + j] = acc;
        }
    }

    free(tmp);
    return result;
}

Compiling with gcc's -O3 flag and testing on a 2.7GHz Intel i7, using a 4000x4000 matrix and 5x5 kernel, I get respectively (avg of 5):

271.21900 ms
127.32000 ms

This is still a considerable improvement over scipy.signal's convolve2d which takes around 2 seconds for the same operation, but I need more speed since I'll be calling this function thousands of times. Changing the data type to float isn't an option at the moment, even though it'd cause a considerable speedup.

Is there a way I can optimise these algorithms further? Can I apply any cache tricks or routines to speed it up?

Any suggestions would be appreciated.

解决方案

If you're running on x86 only then consider using SSE or AVX SIMD optimisation. For double data the throughput improvement will be modest, but if you can switch to float then you may be able to get to around 4x improvement with SSE or 8x with AVX. There are a number of questions and answers about this very topic on StackOverflow already from which you may be able to get some ideas on the implementation. Alternatively there are also many libraries available which include high performance 2D convolution (filtering) routines, and these typically exploit SIMD for performance, e.g. Intel's IPP (commercial), or OpenCV (free).

Another possibility is to exploit multiple cores - split your image into blocks and run each block in its own thread. E.g. if you have a 4 core CPU then split your image into 4 blocks. (See pthreads).

You can of course combine both of the above ideas, if you really want to fully optimise this operation.


Some small optimisations which you can apply to your current code, and to any future implementations (e.g. SIMD):

  • if your kernels are symmetric (or odd-symmetric) then you can reduce the number of operations by adding (subtracting) symmetric input values and performing one multiply rather than two

  • for the separable case, rather than allocating a full frame temporary buffer, consider using a "strip-mining" approach - allocate a smaller buffer, which is full width, but a relatively small number of rows, then process your image in "strips", alternately applying the horizontal kernel and the vertical kernel. The advantage of this is that you have a much more cache-friendly access pattern and a smaller memory footprint.


A few comments on coding style:

  • the register keyword has been redundant for many years, and modern compilers will emit a warning if you try to you use it - save yourself some noise (and some typing) by ditching it

  • casting the result of malloc in C is frowned upon - it's redundant and potentially dangerous.

  • make any input parameters const (i.e. read-only) and use restrict for any parameters which can never be aliased (e.g. a and result) - this can not only help to avoid programming errors (at least in the case of const), but in some cases it can help the compiler to generate better optimised code (particularly in the case of potentially aliased pointers).

这篇关于C中的快速2D卷积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆