优化一维卷积 [英] Optimizing 1D Convolution

查看:171
本文介绍了优化一维卷积的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法加快一维卷积?我试图使dy高速缓存高效
,但使用g ++和-O3进行编译会产生较差的性能。

Is there a way to speed up this 1D convolution ? I tried to make the dy cache efficient but compiling with g++ and -O3 gave worse performances.

我正在研究[-1。 ,0.,1]双向。
不是作业。

I am convolving with [-1. , 0., 1] in both directions. Is not homework.

#include<iostream>
#include<cstdlib>
#include<sys/time.h>

void print_matrix( int height, int width, float *matrix){
    for (int j=0; j < height; j++){
      for (int i=0; i < width; i++){
        std::cout << matrix[j * width + i] << ",";
    }
      std::cout << std::endl;
  }
}

void fill_matrix( int height, int width,  float *matrix){
    for (int j=0; j < height; j++){
      for (int i=0; i < width; i++){
        matrix[j * width + i] = ((float)rand() / (float)RAND_MAX) ;
    }
  }
}

#define RESTRICT __restrict__

void dx_matrix( int height, int width, float * RESTRICT in_matrix,  float * RESTRICT out_matrix, float *min, float *max){
  //init min,max
  *min = *max = -1.F * in_matrix[0] + in_matrix[1]; 

    for (int j=0; j < height; j++){
      float* row = in_matrix + j * width;
      for (int i=1; i < width-1; i++){
        float res = -1.F * row[i-1] + row[i+1]; /* -1.F * value + 0.F * value + 1.F * value; */ 
        if (res > *max ) *max = res;
        if (res < *min ) *min = res;
        out_matrix[j * width + i] = res;
      }
    }
}

void dy_matrix( int height, int width, float * RESTRICT in_matrix,  float * RESTRICT out_matrix, float *min, float *max){
  //init min,max
  *min = *max = -1.F * in_matrix[0] + in_matrix[ width + 1]; 

  for (int j=1; j < height-1; j++){
      for (int i=0; i < width; i++){
        float res = -1.F * in_matrix[ (j-1) * width + i] + in_matrix[ (j+1) * width + i] ;
        if (res > *max ) *max = res;
        if (res < *min ) *min = res;
        out_matrix[j * width + i] =  res;
      }
    }
}

double now (void)                                                                                          
{                                                                                                                    
  struct timeval tv;                                                                                               
  gettimeofday(&tv, NULL);                                                                                         
  return (double)tv.tv_sec + (double)tv.tv_usec / 1000000.0;
}


int main(int argc, char **argv){

  int width, height;
  float *in_matrix;
  float *out_matrix;

  if(argc < 3){
    std::cout  << argv[0] << "usage: width height " << std::endl;
    return -1;
  }

  srand(123);

  width = atoi(argv[1]);
  height = atoi(argv[2]);

  std::cout << "Width:"<< width << " Height:" << height << std::endl;

  if (width < 3){
    std::cout << "Width too short " << std::endl;
    return -1;
  }
  if (height < 3){
    std::cout << "Height too short " << std::endl;
    return -1;
  }

  in_matrix = (float *) malloc( height * width * sizeof(float));
  out_matrix = (float *) malloc( height * width * sizeof(float));

  fill_matrix(height, width, in_matrix);
  //print_matrix(height, width, in_matrix);

  float min, max;

  double a = now();
  dx_matrix(height, width, in_matrix, out_matrix, &min, &max);
  std::cout << "dx min:" << min << " max:" << max << std::endl;

  dy_matrix(height, width, in_matrix, out_matrix, &min, &max);
  double b = now();
  std::cout << "dy min:" << min << " max:" << max << std::endl;
  std::cout << "time: " << b-a << " sec" << std::endl;


  return 0;
}


推荐答案

首先,我会重写dy循环以摆脱 [[(j-1)* width + i]和 in_matrix [(j + 1)* width + i],然后执行以下操作:

First of all, I would rewrite the dy loop to get rid of "[ (j-1) * width + i]" and "in_matrix[ (j+1) * width + i]", and do something like:

  float* p, *q, *out;
 p = &in_matrix[(j-1)*width];
 q = &in_matrix[(j+1)*width];
 out = &out_matrix[j*width];
  for (int i=0; i < width; i++){ 
        float res = -1.F * p[i] + q[i] ; 
        if (res > *max ) *max = res; 
        if (res < *min ) *min = res; 
        out[i] =  res; 
      } 

但这是编译器可能已经在为您做的微不足道的优化。

But that is a trivial optimization that the compiler may already be doing for you.

执行 q [i] -p [i]而不是 -1.f * p [i] + q [i]会更快一些,不过,再次,编译器可能很聪明,可以在您身后做它。

It will be slightly faster to do "q[i]-p[i]" instead of "-1.f*p[i]+q[i]", but, again, the compiler may be smart enough to do that behind your back.

整个过程将从SSE2和多线程中受益匪浅。我敢打赌,至少要比SSE2快3倍。可以使用OpenMP添加多线程,它只需几行代码。

The whole thing would benefit considerably from SSE2 and multithreading. I'd bet on at least a 3x speedup from SSE2 right away. Multithreading can be added using OpenMP and it will only take a few lines of code.

这篇关于优化一维卷积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆