在像素着色器中实现卷积滤波器的最有效方法是什么? [英] What is the most efficient way to implement a convolution filter within a pixel shader?

查看:119
本文介绍了在像素着色器中实现卷积滤波器的最有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在像素着色器中实施卷积对于大量的纹理提取而言代价较高.

Implementing convolution in a pixel shader is somewhat costly as to the very high number of texture fetches.

实现卷积滤波器的直接方法是每个片段使用两个for循环进行 N x N 个查找.一个简单的计算表明,使用4x4高斯内核模糊的1024x1024图像将需要1024 x 1024 x 4 x 4 = 16M查找.

A direct way of implementing a convolution filter is to make N x N lookups per fragment using two for cycles per fragment. A simple calculation says that a 1024x1024 image blurred with a 4x4 Gaussian kernel would need 1024 x 1024 x 4 x 4 = 16M lookups.

对此可以做什么?

  1. 有人可以使用需要较少查询的优化吗?我对像Gaussian这样的特定于内核的优化不感兴趣(或者它们是特定于内核的?)
  2. 至少可以通过某种方式利用将要使用的像素的局部性来使这些查找更快吗?

谢谢!

推荐答案

高斯内核是可分离的,这意味着您可以先进行水平遍历,然后再进行垂直遍历(或者相反).这将O(N ^ 2)变成O(2N).这适用于所有可分离的滤镜,不仅适用于模糊处理(并非所有所有滤镜都是可分离的,但许多滤镜都是可分离的,有些甚至和一样好").

Gaussian kernels are separable, which means you can do a horizontal pass first, then a vertical pass (or the other way around). That turns O(N^2) into O(2N). That works for all separable filters, not just for blur (not all filters are separable, but many are, and some are "as good as").

或者,在模糊滤波器(无论是否为高斯)的特殊情况下,它们都是加权和",您可以利用纹理插值,对于较小的内核尺寸,纹理插值可能更快(但对于大内核).

Or,in the particular case of a blur filter (Gauss or not), which are all kind of "weighted sums", you can take advantage of texture interpolation, which may be faster for small kernel sizes (but definitively not for large kernel sizes).

用于线性插值"方法的图像

image for the "linear interpolation" method

编辑(应杰里·科芬的要求)以总结评论:

EDIT (as requested by Jerry Coffin) to summarize the comments:

在纹理过滤器"方法中,线性插值将根据从样本位置到纹理像素中心的反距离来生成相邻纹理像素的加权和.这是通过纹理化硬件免费完成的.这样,可以在4次提取中求和16个像素.除了分离内核之外,还可以利用纹理过滤.

In the "texture filter" method, linear interpolation will produce a weighted sum of adjacent texels according to the inverse distance from the sample location to the texel center. This is done by the texturing hardware, for free. That way, 16 pixels can be summed in 4 fetches. Texture filtering can be exploited in addition to separating the kernel.

在示例图像的左上角,您的样本(圆圈)击中了纹理像素的中心.得到的结果与最近"过滤相同,得到的是texel的值.在右上角,您位于两个纹理像素之间的中间,得到的是它们之间的50/50平均值(由较浅的蓝色着色器显示).在右下角,您可以在4个纹理像素之间进行采样,但距离左上角的像素有些许距离.这样就可以得到所有4的加权平均值,但是权重偏向左上角(最深的蓝色阴影).

In the example image, on the top left, your sample (the circle) hits the center of a texel. What you get is the same as "nearest" filtering, you get that texel's value. On the top right, you are in the middle between two texels, what you get is the 50/50 average between them (pictured by the lighter shader of blue). On the bottom right, you sample in between 4 texels, but somewhat closer to the top left one. That gives you a weighted average of all 4, but with the weight biased towards the top left one (darkest shade of blue).

以下建议由 datenwolf 提供(见下文):

The following suggestions are courtesy of datenwolf (see below):

我想建议的另一种方法是在傅立叶空间中进行操作,其中卷积变成傅立叶变换信号和傅立叶变换内核的简单乘积.尽管在GPU本身上进行傅立叶变换非常繁琐,至少要使用OpenGL着色器.但这在OpenCL中很容易实现,实际上我是使用OpenCL来实现这些事情的,现在,我的3D引擎中的许多图像处理都是在OpenCL中进行的.

"Another methods I'd like suggest is operating in fourier space, where convolution turns into a simple product of fourier transformed signal and fourier transformed kernel. Although the fourier transform on the GPU itself is quite tedious to implement, at least using OpenGL shaders. But it's quite easy done in OpenCL. Actually I implement such things using OpenCL, now, a lot of image processing in my 3D engine happens in OpenCL.

OpenCL专为在GPU上运行而设计.实际上,快速傅立叶变换是Wikipedia的OpenCL文章上的示例代码片段:en.wikipedia.org/wiki/OpenCL,是的,性能提升是巨大的. FFT最多执行O(n log n),反之亦然.可以预先计算滤波器内核的傅立叶表示.方式是FFT->与内核相乘-> IFFT,这可以归结为O(n + 2n log n)个运算.请注意,实际的卷积在那里只有O(n).

OpenCL has been specifically designed for running on GPUs. A Fast Fourier Transform is actually the piece of example code on Wikipedia's OpenCL article: en.wikipedia.org/wiki/OpenCL and yes the performance gain is tremendous. A FFT executes with at most O(n log n), the reverse the same. The filter kernel fourier representation can be precomputed. The way is FFT -> multiply with kernel -> IFFT, which boils down to O(n + 2n log n) operations. Take note the the actual convolution is just O(n) there.

在像高斯模糊这样的可分离,有限卷积的情况下,分离解决方案的性能将优于傅立叶方法.但是对于广义的,可能的不可分离的内核,傅立叶方法可能是最快的可用方法. OpenCL与OpenGL很好地集成在一起,例如您可以将OpenGL缓冲区(纹理和顶点)用于OpenCL程序的输入和输出."

In the case of a separable, finite convolution like a gaussian blur the separation solution will outperform the fourier method. But in case of generalized, possible non-separable kernels the fourier methods is probably the fastest method available. OpenCL integrates nicely with OpenGL, e.g. you can use OpenGL buffers (textures and vertex) for both input and ouput of OpenCL programs."

这篇关于在像素着色器中实现卷积滤波器的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆