在GPU上计算积分图像真的比在CPU上更快吗? [英] Is computing integral image on GPU really faster than on CPU?

查看:253
本文介绍了在GPU上计算积分图像真的比在CPU上更快吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是GPU计算的新手,所以这可能是一个非常幼稚的问题.
我进行了一些查找,似乎在GPU上计算积分图像是一个不错的主意.
但是,当我真正深入研究它时,我想知道它可能不会比CPU快,特别是对于大图像.因此,我只想了解您对此的想法,以及一些有关GPU是否真的更快的解释.

因此,假设我们有一个MxN图像,则积分图像的CPU计算将需要大约3xMxN的加法,即O(MxN).
在GPU上,按照"OpenGL超级圣经"第6版提供的代码,它需要一些KxMxNxlog2(N)+ KxMxNxlog2(M)运算,其中K是很多移位,乘法,加法...
根据设备的不同,GPU一次可以并行运行32个像素,但仍然是O(MxNxlog2(M)).
我认为,即使在640x480的普通分辨率下,CPU仍会更快.

我在这里错了吗?
这是直接来自本书的着色器代码,其想法是使用2次传递:计算行的积分,然后计算传递1次结果的列的积分.此着色器代码适用于传递1次.

I'm new to GPU computing, so this maybe a really naive question.
I did a few look-ups, and it seems computing integral image on GPU is a pretty good idea.
However, when I really dig into it, I'm wondering maybe it's not faster than CPU, especially for big image. So I just wanna know your ideas about it, and some explanation if GPU is really faster.

So, assuming we have a MxN image, CPU computing of the integral image would need roughly 3xMxN addition, which is O(MxN).
On GPU, follow the code provided by the "OpenGL Super Bible" 6th edition, it would need some KxMxNxlog2(N) + KxMxNxlog2(M) operation, in which K is the number of operations for a lot of bit-shifting, multiplication, addition...
The GPU can work parallel on, say, 32 pixels at a time depend on the device, but it's still O(MxNxlog2(M)).
I think even at the common resolution of 640x480, the CPU is still faster.

Am I wrong here?
This is the shader code straight from the book, the idea is using 2 pass: calculate integral of the rows, then calculate the integral of the column of the result from pass 1. This shader code is for 1 pass.

#version 430 core
layout (local_size_x = 1024) in;
shared float shared_data[gl_WorkGroupSize.x * 2];
layout (binding = 0, r32f) readonly uniform image2D input_image;
layout (binding = 1, r32f) writeonly uniform image2D output_image;
void main(void)
{
    uint id = gl_LocalInvocationID.x;
    uint rd_id;
    uint wr_id;
    uint mask;
    ivec2 P = ivec2(id * 2, gl_WorkGroupID.x);
    const uint steps = uint(log2(gl_WorkGroupSize.x)) + 1;
    uint step = 0;
    shared_data[id * 2] = imageLoad(input_image, P).r;
    shared_data[id * 2 + 1] = imageLoad(input_image,
    P + ivec2(1, 0)).r;
    barrier();
    memoryBarrierShared();
    for (step = 0; step < steps; step++)
    {
        mask = (1 << step) - 1;
        rd_id = ((id >> step) << (step + 1)) + mask;
        wr_id = rd_id + 1 + (id & mask);
        shared_data[wr_id] += shared_data[rd_id];
        barrier();
        memoryBarrierShared();
    }
    imageStore(output_image, P.yx, vec4(shared_data[id * 2]));
    imageStore(output_image, P.yx + ivec2(0, 1),
    vec4(shared_data[id * 2 + 1]));
}

推荐答案

integral image是什么意思?

我的假设是将具有相同分辨率MxNK个图像相加.在这种情况下,它是在O(K.M.N) CPU GPU 上,但是在 GPU 上的固定时间可能会更好,因为gfx内存访问要快得多比在 CPU 端更强.通常,与 CPU 相比, GPU 内核要多于 CPU 内核.

My assumption is summing K images of the same resolution MxN together. in such case it is O(K.M.N) on booth CPU and GPU but the constant time can be better on GPU as gfx memory access is much faster than on CPU side. There are also usually more GPU cores than CPU cores favoring the GPU for this.

如果K太大而无法同时放入 GPU 纹理单元U,则您需要使用多次传递,因此O(K.M.N.log(K)/log(U)) K>U ...其中 CPU 可能会更快.但是,正如之前的评论所建议的那样,未经测试,您只能猜测.您还需要考虑到诸如无绑定纹理和纹理数组之类的东西可以单次执行(但我不确定这样做是否会产生任何性能成本).

If the K is too big to fit into GPU texture units U at once than you need to use multiple passes so O(K.M.N.log(K)/log(U)) K>U... where CPU might be faster in some cases. But as previous comment suggested without a test you can only guess. You need also take into account that there are thing like bind-less texturing and texture arrays which allows to do this in single pass (but I am unsure if there are any performance costs for that).

[Edit1]清除您实际要做的事情

首先,为简单起见,我们假设输入图像为方形像素NxN.因此,我们可以将任务分为H线和V线(类似于 2D FFT )以简化此过程.最重要的是,我们可以将每行细分为M像素组.所以:

First let assume for simplicity we got square input image NxN pixels. So we can divide the task into H-lines and V-lines separately (similar to 2D FFT) to ease up this process. On top of that we can use subdivision of each line into group of M pixels. So:

N = M.K

其中N是分辨率,M是区域分辨率,K是区域数量.

Where N is resolution, M is region resolution and K is number of regions.

  1. 第一名.通过

为每个组渲染线,因此我们得到了大小为MK行.使用片段着色器仅计算输出到某些纹理的每个区域的积分图像.这是T(0.5*K*M^2*N)这整个事情可以由单个QUAD覆盖屏幕的片段完成...

Render line for each group so we got K lines of size M. Using fragment shader that computes integral image of each region only outputting to some texture. This is T(0.5*K*M^2*N) This whole thing can be done in fragment rendered by single QUAD covering the screen ...

第二名.通过

将区域积分转换为完整图像积分.因此,再次渲染K行,并在片段中添加每个先前组的所有最后一个像素.这是T(0.5*K^3*N)这整个事情也可以通过单个QUAD覆盖屏幕的片段来完成...

Convert region integrals to full image integrals. So again render K lines and in fragment add all the last pixels of each previous group. This is T(0.5*K^3*N) This whole thing can too be done in fragment rendered by single QUAD covering the screen ...

在另一个轴方向上对结果执行#1,#2

这整个东西转换为

T(2*N*(0.5*K*M^2+0.5*K^3))
T(N*(K*M^2+K^3))
O(N*(K*M^2+K^3))

现在,您可以调整M以使设置达到最佳性能...如果我将整个内容重写为M,N,则:

Now you can tweak the M to max performance on your setup ... If I rewrite the whole thing into M,N then:

T(N*((N/M)*M^2+(N/M)^3))
T(N*(N*M+(N/M)^3))

因此,您应该最小化温度,以便尝试使用左右的值

So you should minimize the therm so I would try to use values around

N*M = (N/M)^3
N*M = N^3/M^3
M^4 = N^2
M^2 = N
M = sqrt(N) = N^0.5

所以整个事情都变成了:

So the whole thing converts to:

T(N*(N*M+(N/M)^3))
T(N*(N*N^0.5+(N/N^0.5)^3))
T(N^2.5+N^1.5)
O(N^2.5)

比天真O(N^4)快,但是您说对了 CPU 只需较少的操作即可完成O(N^2),并且不需要数据复制或多次通过,因此您应该找出针对您任务的特定硬件的阈值分辨率,并根据测量结果进行选择. PS希望我在计算中的某个地方没有犯傻的错误.另外,如果您分别在 CPU 上进行H和V行,而不是 CPU ,则复杂度将为O(N^3),甚至使用O(N^2.5)都可以使用这种方法,而无需每次通过2次轴.

Which is faster than naive O(N^4) But you're right CPU has less operations to do O(N^2) for this and does not require copy of data or multiple passes so you should find out the threshold resolution on specific HW for your task and chose depending on the measurements. PS Hope I did not do a silly mistake somewhere in the computations. Also if you do H and V lines separately on CPU than the CPU side complexity will be O(N^3) and using this approach even O(N^2.5) without the need for 2 pass per axis.

看看这个类似的质量检查:

Take a look at this similar QA:

我认为这是一个很好的起点.

I think it is a good start point.

这篇关于在GPU上计算积分图像真的比在CPU上更快吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆