优化一维卷积 [英] Optimizing 1D Convolution
问题描述
有没有办法加快一维卷积?我试图使dy高速缓存高效
,但使用g ++和-O3进行编译会产生较差的性能。
Is there a way to speed up this 1D convolution ? I tried to make the dy cache efficient but compiling with g++ and -O3 gave worse performances.
我正在研究[-1。 ,0.,1]双向。
不是作业。
I am convolving with [-1. , 0., 1] in both directions. Is not homework.
#include<iostream>
#include<cstdlib>
#include<sys/time.h>
void print_matrix( int height, int width, float *matrix){
for (int j=0; j < height; j++){
for (int i=0; i < width; i++){
std::cout << matrix[j * width + i] << ",";
}
std::cout << std::endl;
}
}
void fill_matrix( int height, int width, float *matrix){
for (int j=0; j < height; j++){
for (int i=0; i < width; i++){
matrix[j * width + i] = ((float)rand() / (float)RAND_MAX) ;
}
}
}
#define RESTRICT __restrict__
void dx_matrix( int height, int width, float * RESTRICT in_matrix, float * RESTRICT out_matrix, float *min, float *max){
//init min,max
*min = *max = -1.F * in_matrix[0] + in_matrix[1];
for (int j=0; j < height; j++){
float* row = in_matrix + j * width;
for (int i=1; i < width-1; i++){
float res = -1.F * row[i-1] + row[i+1]; /* -1.F * value + 0.F * value + 1.F * value; */
if (res > *max ) *max = res;
if (res < *min ) *min = res;
out_matrix[j * width + i] = res;
}
}
}
void dy_matrix( int height, int width, float * RESTRICT in_matrix, float * RESTRICT out_matrix, float *min, float *max){
//init min,max
*min = *max = -1.F * in_matrix[0] + in_matrix[ width + 1];
for (int j=1; j < height-1; j++){
for (int i=0; i < width; i++){
float res = -1.F * in_matrix[ (j-1) * width + i] + in_matrix[ (j+1) * width + i] ;
if (res > *max ) *max = res;
if (res < *min ) *min = res;
out_matrix[j * width + i] = res;
}
}
}
double now (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec / 1000000.0;
}
int main(int argc, char **argv){
int width, height;
float *in_matrix;
float *out_matrix;
if(argc < 3){
std::cout << argv[0] << "usage: width height " << std::endl;
return -1;
}
srand(123);
width = atoi(argv[1]);
height = atoi(argv[2]);
std::cout << "Width:"<< width << " Height:" << height << std::endl;
if (width < 3){
std::cout << "Width too short " << std::endl;
return -1;
}
if (height < 3){
std::cout << "Height too short " << std::endl;
return -1;
}
in_matrix = (float *) malloc( height * width * sizeof(float));
out_matrix = (float *) malloc( height * width * sizeof(float));
fill_matrix(height, width, in_matrix);
//print_matrix(height, width, in_matrix);
float min, max;
double a = now();
dx_matrix(height, width, in_matrix, out_matrix, &min, &max);
std::cout << "dx min:" << min << " max:" << max << std::endl;
dy_matrix(height, width, in_matrix, out_matrix, &min, &max);
double b = now();
std::cout << "dy min:" << min << " max:" << max << std::endl;
std::cout << "time: " << b-a << " sec" << std::endl;
return 0;
}
推荐答案
首先,我会重写dy循环以摆脱 [[(j-1)* width + i]和 in_matrix [(j + 1)* width + i],然后执行以下操作:
First of all, I would rewrite the dy loop to get rid of "[ (j-1) * width + i]" and "in_matrix[ (j+1) * width + i]", and do something like:
float* p, *q, *out;
p = &in_matrix[(j-1)*width];
q = &in_matrix[(j+1)*width];
out = &out_matrix[j*width];
for (int i=0; i < width; i++){
float res = -1.F * p[i] + q[i] ;
if (res > *max ) *max = res;
if (res < *min ) *min = res;
out[i] = res;
}
但这是编译器可能已经在为您做的微不足道的优化。
But that is a trivial optimization that the compiler may already be doing for you.
执行 q [i] -p [i]而不是 -1.f * p [i] + q [i]会更快一些,不过,再次,编译器可能很聪明,可以在您身后做它。
It will be slightly faster to do "q[i]-p[i]" instead of "-1.f*p[i]+q[i]", but, again, the compiler may be smart enough to do that behind your back.
整个过程将从SSE2和多线程中受益匪浅。我敢打赌,至少要比SSE2快3倍。可以使用OpenMP添加多线程,它只需几行代码。
The whole thing would benefit considerably from SSE2 and multithreading. I'd bet on at least a 3x speedup from SSE2 right away. Multithreading can be added using OpenMP and it will only take a few lines of code.
这篇关于优化一维卷积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!