有效构建汇总区域表 [英] Efficiently Building Summed Area Table

查看:133
本文介绍了有效构建汇总区域表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图构建一个求和区域表,供以后在自适应阈值程序中使用。由于这个代码将用于时间关键的软件,我试图挤压尽可能多的循环。

I am trying to construct a summed area table for later use in an adaptive thresholding routine. Since this code is going to be used in time critical software, I am trying to squeeze as many cycles as possible out of it.

对于性能,表格是每个像素的无符号整数。

For performance, the table is unsigned integers for every pixel.

当我连接我的profiler显示我在执行x-pass时会出现最大的性能瓶颈。

When I attach my profiler, I am showing that my largest performance bottleneck occurs when performing the x-pass.

计算的简单数学表达式是:

The simple math expression for the computation is:

sat_[y * width + x] = sat_[y * width + x - 1] + buff_[y * width + x]
where the running sum resets at every new y position.

在这种情况下, sat _ -D表示SAT的无符号整数, buff _ 是一个8位无符号单色缓冲区。

In this case, sat_ is a 1-D pointer of unsigned integers representing the SAT, and buff_ is an 8-bit unsigned monochrome buffer.

实现如下:

uint *pSat = sat_;
char *pBuff = buff_;

for (size_t y = 0; y < height; ++y, pSat += width, pBuff += width)
{
    uint curr = 0;
    for (uint x = 0; x < width; x += 4)
    {
        pSat[x + 0] = curr += pBuff[x + 0];
        pSat[x + 1] = curr += pBuff[x + 1];
        pSat[x + 2] = curr += pBuff[x + 2];
        pSat[x + 3] = curr += pBuff[x + 3];
    }
}

循环是手动展开的,因为我的编译器没有为我做。我的问题是,整个分割例程花费非常大量的时间刚刚通过该循环,我想知道是否有人有什么可能加速它的任何想法。我可以访问所有的SSE的集合,AVX的任何机器这个例程将运行,所以如果有东西,那将是非常有用的。

The loop is unrolled manually because my compiler (VC11) didn't do it for me. The problem I have is that the entire segmentation routine is spending an extraordinary amount of time just running through that loop, and I am wondering if anyone has any thoughts on what might speed it up. I have access to all of the SSE's sets, and AVX for any machine this routine will run on, so if there is something there, that would be extremely useful.

,一旦我挤出了最后的周期,我然后计划将其扩展到多核,但我想获得单线程计算尽可能紧密,我使模型更复杂。

Also, once I squeeze out the last cycles, I then plan on extending this to multi-core, but I want to get the single thread computation as tight as possible before I make the model more complex.

推荐答案

你有一个依赖链运行在每一行;每个结果取决于前一个。所以你不能在这个方向上vectorised / parallelise。

You have a dependency chain running along each row; each result depends on the previous one. So you cannot vectorise/parallelise in that direction.

但是,它听起来像是每行独立于所有其他,所以你可以通过计算多行同时进行vectorise / paralellise 。你需要转置数组,以允许向量指令访问内存中的相邻元素。 *

But, it sounds like each row is independent of all the others, so you can vectorise/paralellise by computing multiple rows simultaneously. You'd need to transpose your arrays, in order to allow the vector instructions to access neighbouring elements in memory.*

但是,一个问题。沿着行行走现在从缓存的角度来看是绝对可怕的(每次迭代将是缓存未命中)。解决这个问题的方法是交换循环顺序。

However, that creates a problem. Walking along rows would now be absolutely terrible from a cache point of view (every iteration would be a cache miss). The way to solve this is to interchange the loop order.

请注意,每个元素被精确读取一次。你对每个元素只做很少的计算。因此,在您达到100%CPU使用率之前,您基本上会受到主内存带宽的限制。

Note, though, that each element is read precisely once. And you're doing very little computation per element. So you'll basically be limited by main-memory bandwidth well before you hit 100% CPU usage.



*此限制可能在AVX2中解除,我不确定...

这篇关于有效构建汇总区域表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆