为什么gcc自动矢量化不适用于卷积矩阵比3x3? [英] Why gcc autovectorization does not work on convolution matrix biger than 3x3?

查看:200
本文介绍了为什么gcc自动矢量化不适用于卷积矩阵比3x3?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我为卷积矩阵实现了以下程序:

  #include< stdio.h> 
#include< time.h>

#define NUM_LOOP 1000
#define N 128 //输入或输出维数1
#define MN //输入或输出维数2
#define P 5 / /卷积矩阵维数1如果你想要一个3×3卷积矩阵它必须是3
#define QP //卷积矩阵维数2
#define Csize P * Q
#define Cdiv 1 // div用于过滤
#define Coffset 0 //偏移

//函数
void unusual(); //不常见的卷积实现
void naive();
// data
unsigned short int input [N] [M] __attribute __((aligned(32))); //输入数据
unsigned short int output [N] [M] __attribute __((aligned(32))); //输出数据
unsigned short int kernel [P] [Q] __attribute __((aligned(32))); //卷积系数

int main(){
struct timespec tStart,tEnd; //用于记录处理时间
double tTotal,tBest = 10000; //最小toltal时间将被设置为最佳时间

int w = 0;
do {//此循环重复正文以记录最佳时间
clock_gettime(CLOCK_MONOTONIC,& tStart);

//在这里执行的函数:

unusual();

clock_gettime(CLOCK_MONOTONIC,& tEnd);
tTotal =(tEnd.tv_sec - tStart.tv_sec);
tTotal + =(tEnd.tv_nsec - tStart.tv_nsec)/ 1000000000.0;

if(tTotal tBest = tTotal;
} while(w ++< NUM_LOOP);

printf(最佳时间:%df秒%d%重复%dX%d矩阵\ n,tBest,w,MAX1,MAX2);

返回0;


//不寻常的顺序卷积
void异常(){
int i,j,k,temp; (j = Q / 2; j< MQ / 2; j ++){
(i = P / 2; i

温度= 0; (k = 0; k temp + =(kernel [k / P] [k%Q])*(input [i - (P / 2)+ k / Q)] [j - (Q / 2)+(k%Q)]));
$ b}
output [i] [j] =((temp /(Cdiv))+ Coffset);



//天真的执行
inline void naive(){
int i,j,k,l,temp; (j = Q / 2; j temp = 0;
为(i = P / 2; i temp + =(kernel [0,1])的

对于(k = 0; k< P; k ++) k] [l])*(input [i - (P / 2)+ k] [j - (Q / 2)+ 1]);
}
}
output [i] [j] =((temp /(Cdiv))+ Coffset);
}
}
}

问题是当我使用 -O3 用于自动矢量化,它只适用于3x3卷积矩阵。我已经看到Assembly输出和自动向量化只是对3x3内核进行了一些更改并合理地提高了性能(注意:标量版本的异常func比朴素的乐趣慢了20倍),但对于5x5卷积矩阵并没有改进p>

更新:我将naive实现添加到问题中,并将图片大小更改为NxM,conv matrix转换为内核,Cdim1xCdim2转换为PxQ和seqConv函数不寻常的澄清。问题不在于改善不寻常功能的实施。问题是,当所有元素都在内存的相同位置时,gcc使用启发式等。为什么gcc无法改进这种不寻常的实现?
注意:问题不在于简单的实现。 gcc -O3 通过〜7加速提高了3x3,5x5内核的天真实现。而且它对于7x7和9x9也能达到〜1.5加速。为了改进卷积,我使用了intrinsics,并且加速比超常的实现要快40倍以上,比不寻常的卷积快了2倍。所以我的矢量化比我的不寻常的快80倍。手调整优化不是问题。自动矢量化优化是问题所在,也是失败的原因。



GCC命令: gcc -Wall -march = native -O3 -o %e%f



平台:Linux mint,Skylake,gcc 6.2

提前致谢 解决方案

似乎没有人有兴趣回答这个问题。所以我会分享我的发现并在将来更新我的答案。



第一次更新:以我的经验,gcc - 对于 Csize <= 16 的矢量化的fopt-info-vec 报告是因为矢量化因子是 16 ,这也是gcc不会为其他内核大小矢量化异常实现的原因之一。矢量化因子是指可以放入矢量中的元素的数量。在这种情况下,短整数等于 16位元素。



来自维基百科



< blockquote>

第一步,编译器查找可以防止向量化的障碍。矢量化的主要障碍是真实的数据依赖性比矢量长度短。其他障碍包括函数调用和较短的迭代次数。



I've implemented the following program for convolution matrix

#include <stdio.h>
#include <time.h>

#define NUM_LOOP 1000
#define N 128   //input or output dimention 1
#define M N     //input or output dimention 2
#define P 5 //convolution matrix dimention 1 if you want a 3x3 convolution matrix it must be 3
#define Q P     //convolution matrix dimention 2
#define Csize P*Q   
#define Cdiv  1     //div for filter 
#define Coffset 0   //offset 

//functions
void unusual(); //unusual implementation of convolution
void naive();
//data
unsigned short int input[N][M] __attribute__(( aligned(32))); // input data
unsigned short int output[N][M] __attribute__(( aligned(32))); // out put data
unsigned short int kernel[P][Q] __attribute__(( aligned(32)));//convolution coefficients

int main(){
    struct timespec tStart, tEnd;//used to record the processiing time
    double tTotal , tBest=10000;//minimum of toltal time will asign to the best time

    int w=0;
    do{// this loop repeat the body to record the best time
        clock_gettime(CLOCK_MONOTONIC,&tStart);

        //function to be executed here :

        unusual();

        clock_gettime(CLOCK_MONOTONIC,&tEnd);
        tTotal = (tEnd.tv_sec - tStart.tv_sec);
        tTotal += (tEnd.tv_nsec - tStart.tv_nsec) / 1000000000.0;

        if(tTotal<tBest)
            tBest=tTotal;
    } while(w++ < NUM_LOOP);

    printf(" The best time: %lf sec in %d repetition for %dX%d matrix\n",tBest,w, MAX1, MAX2);

    return 0;
}

//unusual sequential convolution
void unusual(){
    int i, j,k,temp;

    for (i=P/2; i< N-P/2; i++){
        for(j=Q/2; j< M-Q/2; j++){
            temp=0;
            for(k=0; k< Csize; k++){
                temp += (kernel[k/P][k%Q]) * (input[i - (P/2) + (k/Q)][j - (Q/2) + (k%Q)]);

            }
            output[i][j]=((temp/(Cdiv))+Coffset);
        }
    }
}
//The naive implementation
inline void naive(){
    int i, j,k,l,temp;
    for (i=P/2; i< N-P/2; i++){
        for(j=Q/2; j< M-Q/2; j++){
            temp=0;

            for(k = 0; k <  P; k++){ 
                for(l = 0; l <  Q; l++){
                    temp += (kernel[k][l]) * (input[i - (P/2)+k][j - (Q/2)+l]);
                }
            }
            output[i][j]=((temp/(Cdiv))+Coffset);
        }
    }
}

The problem is when I use -O3 for auto vectorizing, it just works for an 3x3 convolution matrix. I've seen the Assembly output and auto vectorization just make some changes for 3x3 kernel and improve the performance reasonably (20 time faster note: scalar version of unusual func is slower than naive fun) but there is no improvement for 5x5 convolution matrix

UPDATE: I added the naive implementation to the question and changed the picture size to NxM, conv matrix to kernel, Cdim1xCdim2 to PxQ, and seqConv function to unusual for clarification. The question is not to improve the implementation of the unusual function. The question is while all elements are in the same places of the memory, gcc uses heuristic, etc. why gcc fails to improve this unusual implementation? NOTE: the problem is not about the naive implementation. gcc -O3 improve the naive implementation for 3x3, 5x5 kernels by ~7 speedup. and it also does for 7x7 and 9x9 by ~1.5 speedup. To improve the convolution I used intrinsics and speedup is more than 40x over the naive implementation which is ~ 2x faster than unusual convolution. So my vectorization is ~80x faster than my unusual one. Hand tuning optimization is not the problem. Auto-vectorizer optimization is the problem, and the reason of the fails.

GCC command : gcc -Wall -march=native -O3 -o "%e" "%f"

Platform: Linux mint, Skylake, gcc 6.2

Thanks in advance

解决方案

It seems no one is interested in answering this question. So I will share my findings and update my answer in future.

First update: In my experience, gcc -fopt-info-vec reports vectorizing for Csize <= 16 It is because the vectorization factor is 16 and It is one of the reasons that gcc do not vectorize the unusual implementation for other kernel sizes. Vectorization factor refers to the number of elements which can be put in a vector. In this case short integer is equal to 16-bit element.

From wikipedia:

In the first step, the compiler looks for obstacles that can prevent vectorization. A major obstacle for vectorization is true data dependency shorter than the vector length. Other obstacles include function calls and short iteration counts.

这篇关于为什么gcc自动矢量化不适用于卷积矩阵比3x3?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆