升级到 1.2.162.1 后: vkQueueWaitIdle == VK_ERROR_DEVICE_LOST [英] After upgrading to 1.2.162.1: vkQueueWaitIdle == VK_ERROR_DEVICE_LOST

查看:155
本文介绍了升级到 1.2.162.1 后: vkQueueWaitIdle == VK_ERROR_DEVICE_LOST的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近将我的光线追踪渲染器从 Vulkan SDK 版本 1.2.148.0 升级到 1.2.162.1.这是必要的,因为光线追踪扩展超出了测试版,因此现在适用于非测试版图形驱动程序(我的 RTX 2070 SUPER 版本为 461.40).它需要我对渲染器的光线跟踪方面进行相当多的更改感谢 nvidia 教程.

I recently upgraded my ray tracing renderer from Vulkan SDK version 1.2.148.0 to 1.2.162.1. This was necessary because the ray tracing extension went out of beta and thus now works with non-beta graphics drivers (am on version 461.40 for my RTX 2070 SUPER). It required me to make quite a few changes to the ray tracing side of my renderer which I managed thanks to the nvidia tutorial.

不幸的是,曾经可以工作的代码现在开始导致错误.在许多情况下,提交单个时间命令会导致 vkQueueWaitIdle 失败并显示 VK_ERROR_DEVICE_LOST,从而导致验证错误,表示我正在尝试释放仍在使用的命令缓冲区.这有多种用途:转换图像布局(似乎是 undef 到一般)、构建加速结构、复制缓冲区但不是每次都复制(例如,从暂存缓冲区到设备缓冲区,之后释放暂存缓冲区也会引发错误,因为它仍在使用中,副本尚未完成)...但对于其他用途,它可以正常工作.我目前无法确定一个共同点...

Unfortunately, code that used to work started to cause errors now. In many situations, submitting a single time command causes vkQueueWaitIdle to fail with VK_ERROR_DEVICE_LOST which results in a validation error, saying I'm trying to free the command buffer while it's still in use. This happens for a variety of uses: transitioning an image layout(undef to general it seems), building acceleration structures, copying buffers but not every time (e.g. from a staging to a device buffer, after which freeing the staging buffer also throws an error, since it's still in use, the copy not having finished)... But for other uses, it works fine. I can't currently identify a common denominator...

最后,由于呈现第一帧失败,程序崩溃,因为它的布局未定义 - 我认为这是由前面提到的一个或多个错误引起的.

Finally, the program crashes because presenting the first frame fails, because its layout is undefined - I assume this is caused by one or more of the previously mentioned errors.

自从我上次使用它后有什么变化吗?这是违规代码(endSingleTimeCommands):

Did something change about this since last I used it? This is the offending code (endSingleTimeCommands):

    vkEndCommandBuffer(commandBuffer);

    VkSubmitInfo submitInfo{};
    submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
    submitInfo.commandBufferCount = 1;
    submitInfo.pCommandBuffers = &commandBuffer;

    vkQueueSubmit(graphicsQueue, 1, &submitInfo, VK_NULL_HANDLE);
    switch (vkQueueWaitIdle(graphicsQueue)) {
        //debug output removed for brevity
    };

    vkFreeCommandBuffers(device, commandPool, 1, &commandBuffer);

失败的地方之一是:

    //[fill the structs with info...]

    //function pointer grabbed via vkGetDeviceProcAddr
    vk::vkCmdBuildAccelerationStructuresKHR(cmd, 1, &buildInfo, &buildOffset);

    //[call to the above code here]

但与扩展无关的代码也会失败(有时!),例如:

But also code unrelated to extensions fails (sometimes!) such as this one:

    VkCommandBuffer commandBuffer = beginSingleTimeCommands();

    VkBufferCopy copyRegion{};
    copyRegion.srcOffset = 0; // Optional
    copyRegion.dstOffset = 0; // Optional
    copyRegion.size = size;
    vkCmdCopyBuffer(commandBuffer, srcBuffer, dstBuffer, 1, &copyRegion);

    endSingleTimeCommands(commandBuffer);

也许 beginSingleTimeCommands 也是相关的:

Perhaps beginSingleTimeCommands is also relevant:

    VkCommandBufferAllocateInfo allocInfo{};
    allocInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO;
    allocInfo.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY;
    allocInfo.commandPool = commandPool;
    allocInfo.commandBufferCount = 1;

    VkCommandBuffer commandBuffer;
    if (vkAllocateCommandBuffers(device, &allocInfo, &commandBuffer) != VK_SUCCESS) {
        std::cout << "beginSingleTimeCommands: could not allocate command buffer!\n";
    }

    VkCommandBufferBeginInfo beginInfo{};
    beginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
    beginInfo.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;

    if (vkBeginCommandBuffer(commandBuffer, &beginInfo) != VK_SUCCESS) {
        std::cout << "beginSingleTimeCommands: could not begin command buffer!\n";
    }

    return commandBuffer;

我想我收集了一些额外的信息:我使用 nvidia 管道检查点系统在调用 vkCmdBuildAccelerationStructuresKHR 之前和之后添加一个检查点,并且两个检查点都在 TOP_OF_PIPE.第一次调用此函数后,不再生成检查点输出,这让我相信对构建的第一次调用以某种方式破坏了一切.我想我会三重检查我的 AS 建筑,如果我发现任何东西,我会回复你.

Some additional info I think I gathered: I used the nvidia pipeline checkpoint system to add a checkpoint before and after the call to vkCmdBuildAccelerationStructuresKHR and both checkpoints are at TOP_OF_PIPE. After the first call to this function, no more checkpoint output is generated, leading me to believe that the first call to the build somehow ruins everything. I will triplecheck my AS building I guess, I'll get back to you if I find anything.

推荐答案

事实证明,实际错误可能发生在 vkQueueWaitIdle 返回 DEVICE_LOST 错误的命令缓冲区之前.在我的加速结构构建代码中,我已经并且继续存在各种错误.我无法轻松调试它,因为显然验证层没有显示提供给 vkCmdBuildAccelerationStructures 的结构中是否存在细微错误,而是进行了大量的反复试验.

Turns out, the actual error can occur before the command buffer whose vkQueueWaitIdle returns the DEVICE_LOST error. I've had and continue to have a variety of errors in my acceleration structure building code. I can't easily debug it, because apparently the validation layers don't show if there's subtle mistakes in the structs fed to vkCmdBuildAccelerationStructures, instead it's a lot of trial and error.

我确信升级前验证层会发现一个值得注意的例子是忘记设置 VkAccelerationStructureBuildGeometryInfoKHR::scratchData 字段,这是我最后必须修复的最后一个错误让一切运行起来.

One notable example which I'm certain would've been caught by the validation layers pre-upgrade is forgetting to set the VkAccelerationStructureBuildGeometryInfoKHR::scratchData field, the last mistake I had to fix to finally get everything to run.

我的问题的答案是:不要看触发 DEVICE_LOST 的命令,看看你在该命令之前对队列做了什么,有可能出现错误,反而.事实上,一旦出现第一个 DEVICE_LOST 错误,(几乎?)所有进一步的 vkQueueWaitIdle 都失败并出现相同的错误(与 vkQueueSubmit 相同).在诸如我的复制缓冲区代码第一个失败的情况下,错误总是在队列使用之前发现.

The answer to my question is thus: Don't look at the commands that trigger the DEVICE_LOST, look at what you do with the queue before that command, there's a chance the error is there, instead. In fact, once the first DEVICE_LOST error occurred, (almost?) all further vkQueueWaitIdle failed with the same error (same with the vkQueueSubmit). In cases such as my copy buffer code being the first to fail, the error was always found in the queue usage before that one.

我无法发布我的问题的确切解决方案,因为 - 就像我所说的 - 原因不止一个,到目前为止我只修复了其中的一些,还有一些.我认为这些细节与将来遇到我的问题的人无关,但如果我可以添加任何内容来帮助其他人,请告诉我.

I can't post the exact solution to my problem as - like I've said - there's more than one cause and I've only fixed some of them so far, there's still some left. I think the details are not relevant to future people who come across my question but if there's anything I can add to help other people, please let me know.

这篇关于升级到 1.2.162.1 后: vkQueueWaitIdle == VK_ERROR_DEVICE_LOST的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆