OpenCV平方差总和速度 [英] OpenCV Sum of squared differences speed

查看:135
本文介绍了OpenCV平方差总和速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用openCV进行一些块匹配,并且我注意到与这样的简单for循环相比,平方差的代码之和非常快:

I've been using the openCV to do some block matching and I've noticed it's sum of squared differences code is very fast compared to a straight forward for loop like this:

int SSD = 0;
for(int i =0; i < arraySize; i++)
    SSD += (array1[i] - array2[i] )*(array1[i] - array2[i]);

如果我查看源代码以了解繁重的工作地点,那么
OpenCV人们在循环的每次迭代中一次让其for循环进行4次平方差计算。

If I look at the source code to see where the heavy lifting happens, the OpenCV folks have their for loops do 4 squared difference calculations at a time in each iteration of the loop. The function to do the block matching looks like this.

int64
icvCmpBlocksL2_8u_C1( const uchar * vec1, const uchar * vec2, int len )
{
int i, s = 0;
int64 sum = 0;

for( i = 0; i <= len - 4; i += 4 ) 
{   
    int v = vec1[i] - vec2[i];
    int e = v * v;

    v = vec1[i + 1] - vec2[i + 1]; 
    e += v * v;
    v = vec1[i + 2] - vec2[i + 2];
    e += v * v;
    v = vec1[i + 3] - vec2[i + 3];
    e += v * v;
    sum += e;
}

for( ; i < len; i++ )
{
    int v = vec1[i] - vec2[i];

    s += v * v;
}

return sum + s;
}

此计算适用于无符号8位整数。他们在此函数中对32位浮点数执行类似的计算:

This calculation is for unsigned 8 bit integers. They perform a similar calculation for 32-bit floats in this function:

double
icvCmpBlocksL2_32f_C1( const float *vec1, const float *vec2, int len )
{
double sum = 0;
int i;

for( i = 0; i <= len - 4; i += 4 )
{
    double v0 = vec1[i] - vec2[i];
    double v1 = vec1[i + 1] - vec2[i + 1];
    double v2 = vec1[i + 2] - vec2[i + 2];
    double v3 = vec1[i + 3] - vec2[i + 3];

    sum += v0 * v0 + v1 * v1 + v2 * v2 + v3 * v3;
}
for( ; i < len; i++ )
{
    double v = vec1[i] - vec2[i];

    sum += v * v;
}
return sum;
}

我想知道是否有人知道将一个循环分成几部分像这样的4可能会加速代码?我应该补充一点,这段代码中没有多线程发生。

I was wondering if anyone had any idea if breaking a loop up into chunks of 4 like this might speed up code? I should add that there is no multithreading occuring in this code.

推荐答案

我的猜测是,这只是< a href = http://en.wikipedia.org/wiki/Loop_unwinding rel = nofollow>展开循环-循环的每一遍可保存3个加法和3个比较,这可以是例如,如果检查 len 涉及高速缓存未命中,则可以节省很多。缺点是此优化会增加代码复杂性(例如,如果长度不能被4整除,则最后的附加for循环会完成剩余len%4个项目的循环),当然,这是与体系结构相关的优化

My guess is that this is just a simple implementation of unrolling the loop - it saves 3 additions and 3 compares on each pass of the loop, which can be a great savings if, for example, checking len involves a cache miss. The downside is that this optimization adds code complexity (e.g. the additional for loop at the end to finish the loop for the len % 4 items left if the length is not evenly divisible by 4) and, of course, it's an architecture-dependent optimization whose magnitude of improvement will vary by hardware/compiler/etc...

不过,与大多数优化方法相比,它的改进程度显而易见,并且可能会导致某种程度的性能提升不论采用哪种架构,因此将其扔进去并希望获得最好的机会是低风险的。由于OpenCV是一个受良好支持的代码块,因此,我相信有人对这些代码块进行了检测,发现它们非常值得-正如您自己所做的那样。

Still, it's straightforward to follow compared to most optimizations and will probably result in some sort of performance increase regardless of the architecture, so it's low risk to just throw it in there and hope for the best. Since OpenCV is such a well-supported chunk of code, I'm sure that someone instrumented these chunks of code and found them to be well worth it - as you yourself have done.

这篇关于OpenCV平方差总和速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆