鉴于GPU具有任务队列并且是异步的,计算FPS的正确方法是什么? [英] What is the correct way to calculate the FPS given that GPUs have a task queue and are asynchronous?

查看:149
本文介绍了鉴于GPU具有任务队列并且是异步的,计算FPS的正确方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直认为,计算FPS的正确方法是简单地计时一次绘制循环所需的时间.而且大部分互联网似乎都符合要求.

但是!

现代图形卡被视为异步服务器,因此绘制循环针对GPU上已经存在的顶点/纹理/等数据发出绘制指令.这些调用在GPU上的请求完成之前不会阻塞调用线程,它们只是添加到GPU的任务队列中.因此,传统"(而且相当普遍)的方法肯定只是在测量呼叫分配时间吗?

促使我问的是我已经实施了传统方法,即使渲染的内容导致动画变得断断续续,它也始终提供令人惊讶的高帧频.重新阅读我的OpenGL SuperBible,使我进入了glGenQueries,使我能够对渲染管线的各个部分进行计时.

总而言之,(很少)现代显卡完全不具备计算FPS的传统"方法吗?如果是这样,为什么GPU分析技术相对未知?

解决方案

很难测量fps.不同的人想要测量fps并不一定要测量同一件事,这使事情变得更加困难.所以问问自己.为什么要一个fps编号?

在继续研究所有陷阱和潜在解决方案之前,我确实要指出,这绝不是现代图形卡"特有的问题.如果有的话,过去使用SGI类型的机器会变得更糟,在这种机器上,渲染实际上发生在可能远离客户端(例如,物理上远程)的图形susbsystem上. GL1.0实际上是根据客户端服务器定义的.

无论如何.回到眼前的问题.

fps,表示每秒帧数,实际上是在试图以一个单一的数字传达您的应用程序性能的粗略概念,该数字可以直接与诸如屏幕刷新率之类的东西相关.对于性能的1级近似,它做的还不错.一旦您想要进行更细粒度的分析,它就会完全中断.

问题实际上是,对于应用程序的平滑感"而言,最重要的事情是当您绘制的图片最终出现在屏幕上时.第二件事也很重要,那就是从触发动作到显示效果之间的时间(总延迟).

当应用程序绘制一系列帧时,它会在s0,s1,s2,s3,...时间提交它们,并最终在t0,t1,t2,t3,...屏幕上显示它们.

要保持平稳,您需要满足以下所有条件:

  1. tn-sn不太高(延迟)
  2. t(n + 1)-t(n)小(小于30ms)
  3. 对模拟增量时间也有严格的限制,我将在后面讨论.

当测量渲染的CPU时间时,最终测量s1-s0近似为t1-t0.事实证明,平均来说 并不遥不可及,因为客户端代码永远不会太过领先"(这是假设您一直在渲染帧).以下是其他情况).实际上发生的事情是,当GL尝试超前时,它将最终阻塞CPU(通常在SwapBuffer时).与单个帧上的CPU相比,阻塞时间大约是GPU所花费的额外时间.

如果您真的要测量t1-t0(如您在自己的文章中所述),则查询离它更近了.但是……事情从来没有那么简单.第一个问题是,如果您受CPU限制(这意味着您的CPU不够快,无法始终为GPU提供工作),那么t1-t0的一部分时间实际上就是GPU空闲时间.那不会被查询捕获.您遇到的下一个问题是,取决于您的环境(显示合成环境,vsync),查询实际上可能仅测量您的应用程序向后缓冲区渲染所花费的时间,而不是完整的渲染时间(因为尚未进行显示).当时已更新).它确实使您大致了解了渲染需要花费多长时间,但也不是很精确.还要注意,查询也受制于图形部分的异步性.因此,如果您的GPU在一部分时间内处于空闲状态,则查询可能会错过这一部分. (例如,说您的CPU需要很长时间(100毫秒)来提交帧.GPU会在10毫秒内执行完整帧.即使总处理时间接近100毫秒,您的查询也可能会报告10毫秒.) /p>

现在,关于基于事件的渲染",而不是到目前为止我已经讨论过的连续渲染.这些类型的工作负载的fps并没有多大意义,因为目标是不要尽可能多地每秒钟绘制f. GPU性能的自然指标是ms/f.也就是说,这只是图片的一小部分.真正重要的是,从您决定要更新屏幕到它发生所花费的时间.不幸的是,这个数字很难找到:它通常在您收到触发该过程的事件时开始,而在屏幕更新时结束(您只能使用捕获屏幕输出的摄像机进行测量...).

问题在于,在这两个之间,CPU和GPU处理之间可能存在重叠,或者没有重叠(甚至在CPU停止提交命令与GPU开始执行命令之间存在一定的延迟).这完全取决于实现的决定.最好的办法是在渲染结束时调用glFinish,以确保确定GPU已完成对您发送的命令的处理,并测量了CPU时间.如果您要在之后立即提交下一个事件,则该解决方案确实会降低CPU端以及GPU端的整体性能.

最后讨论模拟增量时间的硬约束":

典型的动画在帧之间使用增量时间来使动画向前移动.主要问题是,对于完全平滑的动画,您确实希望在s1处提交帧时使用的增量时间为t1-t0(这样,当显示t1时,上一帧实际花费的时间的确为t1 -t0).当然,问题在于您不知道提交s1时的t1-t0是什么.因此,通常使用近似值.许多人只使用s1-s0,但可能会崩溃-例如SLI类型的系统在各种GPU之间的AFR渲染中可能会有一些延迟.您也可以尝试通过查询使用t1-t0的近似值(或更可能是t0-t(-1)).弄错这个结果的结果很可能是SLI系统上的微斑点.

最可靠的解决方案是说锁定为30fps,并始终使用1/30s".这也是一种允许您在内容和硬件上留出最少余地的方法,因为您拥有以确保您的渲染确实可以在这33ms内完成.但这是某些控制台开发人员选择做的事情(固定硬件使它更简单).

I always assumed that the correct way to calculate the FPS was to simply time how long it took to do an iteration of your draw loop. And much of the internet seems to be in accordance.

But!

Modern graphics card are treated as asynchronous servers, so the draw loop sends out drawing instructions for vertex/texture/etc data already on the GPU. These calls do not block the calling thread until the request on the GPU completes, they are simply added to the GPU's task queue. So the surely the 'traditional' (and rather ubiquitous) method is just measuring the call dispatch time?

What prompted me to ask was I had implemented the traditional method and it gave consistently absurdly high framerates, even if what was being rendered caused the animation to become choppy. Re-reading my OpenGL SuperBible brought me to glGenQueries which allow me to time sections of the rendering pipeline.

To summarise, is the 'traditional' way of calculating FPS totally defunct with (barely) modern graphics cards? If so why are the GPU profiling techniques relatively unknown?

解决方案

Measuring fps is hard. It's made harder by the fact that various people who want to measure fps don't necessarily want to measure the same thing. So ask yourself this. Why do you want an fps number?

Before I go on and dive into all the pitfalls and potential solutions, I do want to point out that this is by no means a problem specific to "modern graphics cards". If anything, it used to be way worse, with SGI-type machines where the rendering actually happened on a graphics susbsystem that could be remote to the client (as in, physically remote). GL1.0 was actually defined in terms of client-server.

Anyways. Back to the problem at hand.

fps, meaning frames per second, really is trying to convey, in a single number, a rough idea of the performance of your application, in a number that can be directly related to things like the screen refresh rate. for a 1st level approximation of performance, it does an ok job. It breaks completely as soon as you want to delve into more fine-grained analysis.

The problem is really that the thing that matters most as far as "feeling of smoothness" of an application, is when the picture you drew ends up on the screen. The secondary thing that matters quite a bit too is how long it took between the time you triggered an action and when its effect shows up on screen (the total latency).

As an application draws a series of frames, it submits them at times s0, s1, s2, s3,... and they end up showing on screen at t0, t1, t2, t3,...

To feel smooth you need all the following things:

  1. tn-sn is not too high (latency)
  2. t(n+1)-t(n) is small (under 30ms)
  3. there is also a hard constraint on the simulation delta time, which I'll talk about later.

When you measure the CPU time for your rendering, you end up measuring s1-s0 to approximate t1-t0. As it turns out, this, on average, is not far from the truth, as client code will never go "too far ahead" (this is assuming you're rendering frames all the time though. See below for other cases). What does happen in fact is that the GL will end up blocking the CPU (typically at SwapBuffer time) when it tries to go too far ahead. That blocking time is roughly the extra time taken by the GPU compared to the CPU on a single frame.

If you really want to measure t1-t0, as you mentioned in your own post, Queries are closer to it. But... Things are never really that simple. The first problem is that if you're CPU bound (meaning your CPU is not quick enough to always provide work to the GPU), then a part of the time t1-t0 is actually idle GPU time. That won't get captured by a Query. The next problem you hit is that depending on your environment (display compositing environment, vsync), queries may actually only measure the time your application spends on rendering to a back buffer, which is not the full rendering time (as the display has not been updated at that time). It does get you a rough idea of how long your rendering will take, but will not be precise either. Further note that Queries are also subject to the asynchronicity of the graphics part. So if your GPU is idle part of the time, the query may miss that part. (e.g. say your CPU is taking very long (100ms) to submit your frame. The the GPU executes the full frame in 10ms. Your query will likely report 10ms, even though the total processing time was closer to 100ms...).

Now, with respect to "event-based rendering" as opposed to continuous one I've discussed so far. fps for those types of workloads doesn't make much sense, as the goal is not to draw as many f per s as possible. There the natural metric for GPU performance is ms/f. That said, it is only a small part of the picture. What really matters there is the time it took from the time you decided you wanted to update the screen and the time it happened. Unfortunately, that number is hard to find: It typically starts when you receive an event that triggers the process and ends when the screen is updated (something that you can only measure with a camera capturing the screen output...).

The problem is that between the 2, you have potential overlap between the CPU and GPU processing, or not (or even, some delay between the time the CPU stops submitting commands and the GPU starts executing them). And that is completely up to the implementation to decide. The best you can do is to call glFinish at the end of the rendering to know for sure the GPU is done processing the commands you sent, and measure the time on the CPU. That solution does reduce the overall performance of the CPU side, and potentially the GPU side as well if you were going to submit the next event right after...

Last the discussion about the "hard constraint on simulation delta time":

A typical animation uses a delta time between frames to move the animation forward. The major problem is that for a fully smooth animation, you really want the delta time you use when submitting your frame at s1 to be t1-t0 (so that when t1 shows, the time that actually was spent from the previous frame was indeed t1-t0). The problem of course is that you have no idea what t1-t0 is at the time you submit s1... So you typically use an approximation. Many just use s1-s0, but that can break down - e.g. SLI-type systems can have some delays in AFR rendering between the various GPUs). You could also try to use an approximation of t1-t0 (or more likely t0-t(-1)) through queries. The result of getting this wrong is mostly likely micro-stuttering on SLI systems.

The most robust solution is to say "lock to 30fps, and always use 1/30s". It's also the one that allows the least leeway on content and hardware, as you have to ensure your rendering can indeed be done in those 33ms... But is what some console developers choose to do (fixed hardware makes it somewhat simpler).

这篇关于鉴于GPU具有任务队列并且是异步的,计算FPS的正确方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆