什么是GPU驱动的渲染? [英] What is GPU driven rendering?

查看:196
本文介绍了什么是GPU驱动的渲染?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如今,我从不同的地方听到有关所谓的GPU驱动的渲染的信息,这是一种全新的渲染范例,完全不需要绘制调用,并且新版本的OpenGL和Vulkan API都支持它.有人可以解释一下它在概念上实际上是如何工作的,与传统方法的主要区别是什么?

解决方案

概述

为了渲染场景,必须发生许多事情.您需要走动场景图以找出存在哪些对象.对于现在存在的每个对象,您现在需要确定它是否可见.对于每个可见的对象,您需要确定其几何形状存储在哪里,将使用哪些纹理和缓冲区来渲染该对象,使用哪些着色器来渲染该对象,等等.然后渲染该对象.

处理此问题的传统"方法供CPU处理.场景图位于CPU可访问的内存中.CPU在该场景图上进行可见性剔除.CPU获取可见的对象并访问有关几何的一些CPU数据(OpenGL缓冲区对象和纹理名称,Vulkan描述符集和 VkBuffer s等),着色器等,并将其作为状态数据传输到GPU.然后,CPU发出GPU命令,以该状态渲染该对象.

现在,如果我们再往前走,最传统"的方法根本就不涉及GPU.CPU只会获取此网格和纹理数据,执行顶点转换,栅格化等操作,从而在CPU内存中生成图像.但是,我们开始将其中一些卸载到单独的处理器中.我们从光栅化的东西开始(最早的图形芯片只是光栅化器; CPU完成了所有顶点T&L).然后,我们将顶点转换合并到GPU中.当我们这样做时,我们开始必须将顶点数据存储在GPU可访问的内存中,以便GPU可以自己读取时间.

我们做了所有这些工作,将这些事情转移到单独的处理器上有两个原因:GPU的运行速度快了许多,而CPU现在可以花时间做其他事情了

GPU驱动的渲染只是该过程的下一个阶段.我们从没有GPU到光栅化GPU,再到顶点GPU,再到场景图级GPU.传统"方法将如何渲染到GPU的工作量减少了.GPU驱动的渲染减轻了渲染内容的决定.

机制

现在,我们一直没有这样做的原因是因为基本的渲染命令都接收来自CPU的数据. glDrawArrays/Elements 从CPU获取许多参数.因此,即使我们使用GPU生成该数据,我们也需要完全的GPU/CPU同步,以便CPU可以读取数据...并将其直接返回给GPU.

那没有帮助.

OpenGL 4给了我们各种形式的间接渲染.基本思想是,它们不是从函数调用中获取这些参数,而只是存储在GPU内存中的数据.CPU仍然必须进行函数调用才能开始渲染操作,但是该调用的实际参数只是存储在GPU内存中的数据.

另一半要求GPU具有以间接渲染可以读取的格式将数据写入GPU内存的功能.从历史上看,GPU上的数据沿一个方向运行:读取数据是为了将其转换为渲染目标中的像素.我们需要一种方法,可以从 other 任意数据生成半任意数据,全部在GPU上.

较早的机制是为此目的使用(ab)转换反馈,但是如今我们仅使用图像加载/存储.计算着色器在这里也有帮助,因为它们被设计为在标准渲染管道之外,因此不受其限制.

GPU驱动的渲染的理想形式使场景图成为渲染操作的一部分.有较少的形式,例如让GPU只做每个对象视口剔除.但是,让我们看一下最理想的过程.从CPU的角度来看,这看起来像:

  1. 更新GPU内存中的场景图.
  2. 发布一个或多个可生成多绘制间接渲染命令的计算着色器.
  3. 发出一个多次绘制的间接调用,该调用可以绘制所有内容.

当然,现在没有免费的午餐了.在GPU上进行全场景图处理需要以一种对GPU处理有效的方式来构建场景图.更重要的是,必须在设计可见性剔除机制时充分考虑有效的GPU处理.这不是我要解决的复杂性.

实施

相反,让我们看一下使图形零件起作用的基本要点.我们必须在这里整理很多东西.

请参见,间接渲染命令仍然是常规的旧渲染命令.尽管多绘制表单绘制了多个不同的对象",但它仍然是一个CPU渲染命令.这意味着在此命令期间,所有渲染状态是固定的.

因此,在此多次绘制操作范围内的所有内容都必须使用相同的着色器,绑定的缓冲区和纹理,混合参数,模板状态等.这使得实现GPU驱动的渲染操作有点复杂.

状态和明暗器

如果在渲染操作中需要混合或类似的基于状态的差异,那么您将不得不发出另一个渲染命令.因此,在混合情况下,场景图处理将必须计算多个渲染命令,每个集合用于特定的混合模式集.您可能还需要使该系统对透明对象进行排序(除非您使用OIT机制对其进行渲染).因此,您只有一个渲染命令,而不仅仅是一个渲染命令.

但是,本练习的重点不是只具有一个渲染命令;而是一个渲染命令.关键是,CPU渲染命令的数量不会随着您渲染的东西而改变.场景中有多少个对象都没关系;CPU将发出相同数量的渲染命令.

在着色器方面,此技术需要某种程度的"ubershader"样式:在这种情况下,您将有非常少量的相当灵活的着色器.您想要参数化着色器,而不是拥有数十个或数百个.

无论如何,事情可能都会以这种方式发生,特别是在延迟渲染方面.延迟渲染器的几何遍历倾向于使用相同类型的处理,因为它们只是在进行顶点转换并提取材质参数.通常最大的区别在于进行蒙皮渲染与非蒙皮渲染,但这实际上只是2种着色器变体.您可以像混合案例一样处理它.

说到延迟渲染,GPU驱动的进程还可以遍历灯光图形,从而生成绘制调用和照明遍历的渲染数据.因此,尽管照明通行证将需要单独的抽奖电话,但无论灯的数量如何,它仍将仅需要单个多次抽奖电话.

缓冲区

在这里事情开始变得有趣起来.参见,如果GPU正在处理场景图,则意味着GPU需要以某种方式将多绘制命令中的特定绘制与特定绘制所需的资源相关联.可能还需要将数据放入那些资源中,例如给定对象的矩阵变换等等.

哦,您还需要以某种方式将顶点输入数据绑定到特定的子图上.

最后一部分可能是最复杂的.OpenGL/Vulkan的标准顶点输入法从其提取的缓冲区是状态数据;它们不能在多绘制操作的子绘制之间更改.

您最好的选择是尝试使用相同的顶点格式将每个对象的数据放入相同的缓冲对象中.本质上,您拥有一个庞大的顶点数据数组.然后,您可以将图形参数用于子图形以选择要使用的缓冲区的哪些部分.

但是,对于按对象数据(矩阵等),我们该如何处理,通常使用UBO或全局 uniform 进行处理?您如何有效地更改CPU渲染命令中的缓冲区绑定状态?

好吧...你不能.所以你作弊.

首先,您意识到SSBO可以任意大.因此,您实际上不需要更改缓冲区绑定状态.您需要的是一个包含每个人每个对象数据的SSBO.对于每个顶点,VS只需从庞大的数据列表中为该子图选择正确的数据.

这是通过特殊的顶点着色器输入完成的: gl_DrawID .发出多绘制命令时,VS将获得一个输入值,该值代表该多绘制命令中此子绘制操作的索引.因此,您可以使用 gl_DrawID 索引每个对象数据的表,以获取该特定对象的适当数据.

这也意味着生成此子图的计算着色器还需要使用该子图的索引来定义在数组中的哪个位置放置该子图的每个对象数据.因此,编写子图的CS还需要负责设置与子图匹配的每个对象的数据.

纹理

OpenGL和Vulkan对可以绑定的纹理数量有非常严格的限制.好吧,实际上这些限制相对于传统渲染而言是相当大的,但是在GPU驱动的渲染领域中,我们需要一个CPU渲染调用来潜在地访问任何纹理.那很难.

现在,我们有 gl_DrawID ;结合上面提到的表,我们可以检索每个对象的数据.那么:我们如何将其转换为纹理?

有多种方法.我们可以将一堆2D纹理放入阵列纹理.然后,我们可以使用 gl_DrawID 从每个对象数据的SSBO中获取数组索引;该数组索引成为我们用来获取我们的"纹理的数组层.请注意,我们不直接使用 gl_DrawID ,因为多个不同的子绘制可能使用相同的纹理,并且设置绘制调用数组的GPU代码无法控制纹理的显示顺序在我们的数组中.

数组纹理有明显的缺点,其中最值得注意的是我们必须尊重数组纹理的局限性.数组中的所有元素必须使用相同的图像格式.它们必须都具有相同的大小.另外,阵列纹理中的阵列层数也有限制,因此您可能会遇到它们.

数组纹理的替代方法在API方面有所不同,尽管它们基本上归结为同一件事:将数字转换为纹理.

在OpenGL领域中,您可以使用无纹理化(对于支持它的硬件).注意,这些不是数组纹理.在GLSL中,这些是排列的 sampler 类型: uniform sampler2D my_2D_textures [6000]; .在OpenGL中,这将是编译错误,因为每个数组元素代表一个纹理的不同绑定点,并且您不能有6000个不同的绑定点.在Vulkan中,无论该数组中有多少个元素,一个数组化的采样器仅表示一个单个描述符.Vulkan实现对此类数组中可以包含多少个元素有限制,但是支持您需要使用此功能的硬件( shaderSampledImageArrayDynamicIndexing )通常会提供足够的限制.

因此,您的着色器使用 gl_DrawID 从每个对象的数据中获取索引.只需从采样器数组中获取值,该索引就会变成 sampler .该数组描述符中纹理的唯一限制是它们必须都具有相同的类型和基本数据格式( sampler2D 的浮点2D, usamplerCube 的无符号整数立方体映射), 等等).格式,纹理大小,mipmap数量等细节均无关紧要.

如果您担心Vulkan的采样器阵列与OpenGL的无绑定采样器之间的成本差异,那就不要了.无约束力的实现反正是在您的背后./p>

Nowadays I'm hearing from different places about the so called GPU driven rendering which is a new paradigm of rendering which doesn't require draw calls at all, and that it is supported by the new versions of OpenGL and Vulkan APIs. Can someone explain how it actually works on conceptual level and what are the main differences with the traditional approach?

解决方案

Overview

In order to render a scene, a number of things have to happen. You need to walk your scene graph to figure out which objects exist. For each object which exists, you now need to determine if it is visible. For each object which is visible, you need to figure out where its geometry is stored, which textures and buffers will be used to render that object, which shaders to use to render the object, and so forth. Then you render that object.

The "traditional" method handling this is for the CPU to handle this process. The scene graph lives in CPU-accessible memory. The CPU does visibility culling on that scene graph. The CPU takes the visible objects and access some CPU data about the geometry (OpenGL buffer object and texture names, Vulkan descriptor sets and VkBuffers, etc), shaders, etc, transferring this as state data to the GPU. Then the CPU issues a GPU command to render that object with that state.

Now, if we go back farther, the most "traditional" method doesn't involve a GPU at all. The CPU would just take this mesh and texture data, do vertex transformations, rasterizatization, and so forth, producing an image in CPU memory. However, we started off-loading some of this to a separate processor. We started with the rasterization stuff (the earliest graphics chips were just rasterizers; the CPU did all the vertex T&L). Then we incorporated the vertex transformations into the GPU. When we did that, we started having to store vertex data in GPU accessible memory so the GPU could read it on its own time.

We did all of that, off-loading these things to a separate processor for two reasons: the GPU was (much) faster at it, and the CPU can now spend its time doing something else.

GPU driven rendering is just the next stage in that process. We went from no GPU, to rasterization GPU, to vertex GPU, and now to scene-graph-level GPU. The "traditional" method offloads how to render to the GPU; GPU driven rendering offloads the decision of what to render.

Mechanism

Now, the reason we haven't been doing this all along is because the basic rendering commands all take data that comes from the CPU. glDrawArrays/Elements takes a number of parameters from the CPU. So even if we used the GPU to generate that data, we would need a full GPU/CPU synchronization so that the CPU could read the data... and give it right back to the GPU.

That's not helpful.

OpenGL 4 gave us indirect rendering of various forms. The basic idea is that, instead of taking those parameters from a function call, they're just data stored in GPU memory. The CPU still has to make a function call to start the rendering operation, but the actual parameters to that call are just data stored in GPU memory.

The other half of that requires the ability of the GPU to write data to GPU memory in a format that indirect rendering can read. Historically, data on GPUs goes in one direction: data gets read for the purpose of being converted into pixels in a render target. We need a way to generate semi-arbitrary data from other arbitrary data, all on the GPU.

The older mechanism for this was to (ab)use transform feedback for this purpose, but nowadays we just use SSBOs or failing that, image load/store. Compute shaders help here as well, since they are designed to be outside of the standard rendering pipeline and therefore are not bound to its limitations.

The ideal form of GPU-driven rendering makes the scene-graph part of the rendering operation. There are lesser forms, such as having the GPU do nothing more than per-object viewport culling. But let's look at the most ideal process. From the perspective of the CPU, this looks like:

  1. Update the scene graph in GPU memory.
  2. Issue one or more compute shaders that generate multi-draw indirect rendering commands.
  3. Issue a single multi-draw indirect call that draws everything.

Now of course, there's no such thing as a free lunch. Doing full scene graph processing on the GPU requires building your scene graph in a way that is efficient for GPU processing. Even more importantly, visibility culling mechanisms have to be engineered with efficient GPU processing in mind. That's complexity I'm not going to address here.

Implementation

Instead, let's look at the nuts-and-bolts of making the drawing part work. We have to sort out a lot of things here.

See, the indirect rendering command is still a regular old rendering command. While the multi-draw form draws multiple distinct "objects", it's still one CPU rendering command. This means that, for the duration of this command, all rendering state is fixed.

So everything under the purview of this multi-draw operation must use the same shader, bound buffers&textures, blending parameters, stencil state, and so forth. This makes implementing a GPU-driven rendering operation a bit complicated.

State and Shaders

If you need blending, or similar state-based differences in rendering operations, then you are going to have to issue another rendering command. So in the blending case, your scene-graph processing is going to have to compute multiple sets of rendering commands, with each set being for a specific set of blending modes. You may also need to have this system sort transparent objects (unless you're rendering them with an OIT mechanism). So instead of having just one rendering command, you have a small number of them.

But the point of this exercise however isn't to have only one rendering command; the point is that the number of CPU rendering commands does not change with regard to how much stuff you're rendering. It shouldn't matter how many objects are in the scene; the CPU will be issuing the same number of rendering commands.

When it comes to shaders, this technique requires some degree of "ubershader" style: where you have a very few number of rather flexible shaders. You want to parameterize your shader rather than having dozens or hundreds of them.

However things were probably going to fall out that way anyway, particularly with regard to deferred rendering. The geometry pass of deferred renderers tends to use the same kind of processing, since they're just doing vertex transformation and extracting material parameters. The biggest difference usually is with regard to doing skinned vs. non-skinned rendering, but that's really only 2 shader variations. Which you can handle similarly to the blending case.

Speaking of deferred rendering, the GPU driven processes can also walk the graph of lights, thus generating the draw calls and rendering data for the lighting passes. So while the lighting pass will need a separate draw call, it will still only need a single multidraw call regardless of the number of lights.

Buffers

Here's where things start to get interesting. See, if the GPU is processing the scene graph, that means that the GPU needs to somehow associate a particular draw within the multi-draw command with the resources that particular draw needs. It may also need to put the data into those resources, like the matrix transforms for a given object and so forth.

Oh, and you also somehow need to tie the vertex input data to a particular sub-draw.

That last part is probably the most complicated. The buffers which OpenGL/Vulkan's standard vertex input method pull from are state data; they cannot change between sub-draws of a multi-draw operation.

Your best bet is to try to put every object's data in the same buffer object, using the same vertex format. Essentially, you have one gigantic array of vertex data. You can then use the drawing parameters for the sub-draw to select which parts of the buffer(s) to use.

But what do we do about per-object data (matrices, etc), things you would typically use a UBO or global uniform for? How do you effectively change the buffer binding state within a CPU rendering command?

Well... you can't. So you cheat.

First, you realize that SSBOs can be arbitrarily large. So you don't really need to change buffer binding state. What you need is a single SSBO that contains everyone's per-object data. For each vertex, the VS simply needs to pick out the correct data for that sub-draw from the giant list of data.

This is done via a special vertex shader input: gl_DrawID. When you issue a multi-draw command, the VS gets an input value that represents the index of this sub-draw operation within the multidraw command. So you can use gl_DrawID to index into a table of per-object data to fetch the appropriate data for that particular object.

This also means that the compute shader which generates this sub-draw also needs use the index of that sub-draw to define where in the array to put the per-object data for that sub-draw. So the CS that writes a sub-draw also needs to be responsible for setting up the per-object data that matches the sub-draw.

Textures

OpenGL and Vulkan have pretty strict limits on the number of textures that can be bound. Well actually those limits are quite large relative to traditional rendering, but in GPU driven rendering land, we need a single CPU rendering call to potentially access any texture. That's harder.

Now, we do have gl_DrawID; coupled with the table mentioned above, we can retrieve per-object data. So: how do we convert this to a texture?

There are multiple ways. We could put a bunch of our 2D textures into an array texture. We can then use gl_DrawID to fetch an array index from our SSBO of per-object data; that array index becomes the array layer we use to fetch "our" texture. Note that we don't use gl_DrawID directly because multiple different sub-draws could use the same texture, and because the GPU code that sets up the array of draw calls does not control the order in which textures appear in our array.

Array textures have obvious downsides, the most notable of which is that we must respect the limitations of an array texture. All elements in the array must use the same image format. They must all be of the same size. Also, there are limits on the number of array layers in an array texture, so you might encounter them.

The alternatives to array textures differ along API lines, though they basically boil down to the same thing: convert a number into a texture.

In OpenGL land, you can employ bindless texturing (for hardware that supports it). This system provides a mechanism that allows one to generate a 64-bit integer handle which represents a particular texture, pass this handle to the GPU (since it is just an integer, use whatever mechanism you want), and then convert this 64-bit handle into a sampler type. So you use gl_DrawID to fetch a 64-bit handle from the per-object data, then convert that into a sampler of the appropriate type and use it.

In Vulkan land, you can employ sampler arrays (for hardware that supports it). Note that these are not array textures; in GLSL, these are sampler types which are arrayed: uniform sampler2D my_2D_textures[6000];. In OpenGL, this would be a compile error because each array element represents a distinct bind point for a texture, and you cannot have 6000 distinct bind points. In Vulkan, an arrayed sampler only represents a single descriptor, no matter how many elements are in that array. Vulkan implementations have limits on how many elements there can be in such arrays, but hardware that supports the feature you need to employ this (shaderSampledImageArrayDynamicIndexing) will typically offer a generous limit.

So your shader uses gl_DrawID to get an index from the per-object data. The index is turned into a sampler by just fetching the value from the sampler array. The only limitation for textures in that arrayed descriptor is that they must all be of the same type and basic data format (floating-point 2D for sampler2D, unsigned integer cubemap for usamplerCube, etc). The specifics of formats, texture sizes, mipmap counts, and the like are all irrelevant.

And if you're concerned about the cost difference of Vulkan's array of samplers compared to OpenGL's bindless, don't be; implementations of bindless are just doing this behind your back anyway.

这篇关于什么是GPU驱动的渲染?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆