有效构建汇总区域表 [英] Efficiently Building Summed Area Table

查看：133 发布时间：2016/10/30 22:29:53 c++ visual-c++ image-processing sse

本文介绍了有效构建汇总区域表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图构建一个求和区域表，供以后在自适应阈值程序中使用。由于这个代码将用于时间关键的软件，我试图挤压尽可能多的循环。

I am trying to construct a summed area table for later use in an adaptive thresholding routine. Since this code is going to be used in time critical software, I am trying to squeeze as many cycles as possible out of it.

对于性能，表格是每个像素的无符号整数。

For performance, the table is unsigned integers for every pixel.

当我连接我的profiler显示我在执行x-pass时会出现最大的性能瓶颈。

When I attach my profiler, I am showing that my largest performance bottleneck occurs when performing the x-pass.

计算的简单数学表达式是：

The simple math expression for the computation is:

sat_[y * width + x] = sat_[y * width + x - 1] + buff_[y * width + x]
where the running sum resets at every new y position.

在这种情况下， sat _ -D表示SAT的无符号整数， buff _ 是一个8位无符号单色缓冲区。

In this case, sat_ is a 1-D pointer of unsigned integers representing the SAT, and buff_ is an 8-bit unsigned monochrome buffer.

实现如下：

uint *pSat = sat_;
char *pBuff = buff_;

for (size_t y = 0; y < height; ++y, pSat += width, pBuff += width)
{
    uint curr = 0;
    for (uint x = 0; x < width; x += 4)
    {
        pSat[x + 0] = curr += pBuff[x + 0];
        pSat[x + 1] = curr += pBuff[x + 1];
        pSat[x + 2] = curr += pBuff[x + 2];
        pSat[x + 3] = curr += pBuff[x + 3];
    }
}

循环是手动展开的，因为我的编译器没有为我做。我的问题是，整个分割例程花费非常大量的时间刚刚通过该循环，我想知道是否有人有什么可能加速它的任何想法。我可以访问所有的SSE的集合，AVX的任何机器这个例程将运行，所以如果有东西，那将是非常有用的。

The loop is unrolled manually because my compiler (VC11) didn't do it for me. The problem I have is that the entire segmentation routine is spending an extraordinary amount of time just running through that loop, and I am wondering if anyone has any thoughts on what might speed it up. I have access to all of the SSE's sets, and AVX for any machine this routine will run on, so if there is something there, that would be extremely useful.

，一旦我挤出了最后的周期，我然后计划将其扩展到多核，但我想获得单线程计算尽可能紧密，我使模型更复杂。

Also, once I squeeze out the last cycles, I then plan on extending this to multi-core, but I want to get the single thread computation as tight as possible before I make the model more complex.

有效构建汇总区域表 [英] Efficiently Building Summed Area Table

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

有效构建汇总区域表 [英] Efficiently Building Summed Area Table

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭