CUDA并行光线追踪:非常低的加速 [英] CUDA-parallelized raytracer: very low speedup

查看:583
本文介绍了CUDA并行光线追踪:非常低的加速的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用(py)CUDA编码一个raytracer,我得到一个非常低的加速;例如,在1000x1000的图像中,GPU并行化代码仅比在CPU中执行的顺序代码快4倍。



对于每个射线,我必须解决5方程(光线追踪器使用本文中描述的过程生成黑洞的图像),因此我的设置是以下:每个射线在单独的块中计算,其中5个线程使用共享存储器计算方程。也就是说,如果我想生成宽度为 W 像素和高度为 H 像素的图像,设置为:




  • 网格: W 块x H 阻止。

  • 阻止: 5

    最昂贵的计算是方程的分辨率,我用自定义的Runge Kutta 4-5算法解决。



    代码很长,很难在​​这么短的时间内解释,但你可以在 Github 。 CUDA内核是此处和Runge Kutta解算器是此处。具有完全相同求解程序的顺序版本的CPU版本可以在同一个库中找到



    要解决的方程式涉及多个计算,我担心某些函数的CPU优化,例如 sin cos sqrt 导致低加速(?)



    我的机器规格是:




    • GPU:GeForce GTX 780

    • CPU:Intel Core i7 CPU 930 @ 2.80GHz



    我的问题是:


    1. 在GPU并行化光线追踪器中根据顺序代码获得3x或4x的加速是正常的吗?

    2. 您看到CUDA中有任何错误




  • p>我知道问题可能太具体,但如果你需要更多的信息,只是说,我会很高兴提供它。

    解决方案



    1. 在GPU并行化光线追踪器中,根据顺序代码获得3x或4x的加速是正常的吗?


    一段字符串多长时间?



    1. 您认为CUDA中有任何错误设置或代码中可能导致此行为?


    您使用的是完全不适当的块大小,这浪费了GPU的潜在计算能力的大约85%。



    1. 我缺少一些重要的东西吗?


    问题。设置正确的执行参数大约是CUDA实际性能调优要求的50%,您应该能够通过选择正确的块大小获得显着的性能改进。除此之外,仔细的剖析应该是你的下一个调用端口。



    [这个回答从评论汇总并添加为社区wiki条目,以获得这个未回答的清单,没有足够的密切投票结束]。


    I'm coding a raytracer using (py)CUDA and I'm obtaining a really low speedup; for example, in a 1000x1000 image, the GPU-parallelized code is just 4 times faster than the sequential code, executed in the CPU.

    For each ray I have to solve 5 equations (the raytracer generates images of black holes using the process described in this paper), so my setup is the following: each ray is computed in a separate block, where 5 threads compute the equations using shared memory. That is, if I want to generate an image with a width of W pixels and a height of H pixels, the setup is:

    • Grid: W blocks x H blocks.
    • Block: 5 threads.

    The most expensive computation is the resolution of the equations, that I solve with a custom Runge Kutta 4-5 algorithm.

    The code is quite long and hard to explain in such a short question, but you can see it in Github. The CUDA kernel is here and the Runge Kutta solver is here. The CPU version with the sequential version of the exact same solver can be found in the same repo.

    The equations to solve involve several computations, and I'm afraid the CPU optimization of some functions like sin, cos and sqrt is causing the low speedup (?)

    My machine specs are:

    • GPU: GeForce GTX 780
    • CPU: Intel Core i7 CPU 930 @ 2.80GHz

    My questions are:

    1. Is it normal to get a speedup of 3x or 4x in a GPU-parallelized raytracer against a sequential code?
    2. Do you see anything wrong in the CUDA setup or in the code that could be causing this behaviour?
    3. Am I missing something important?

    I understand the question can be too specific, but if you need more information, just say it, I'll be glad to provide it.

    解决方案

    1. Is it normal to get a speedup of 3x or 4x in a GPU-parallelized raytracer against a sequential code?

    How long is a piece of string? There is no answer to this question.

    1. Do you see anything wrong in the CUDA setup or in the code that could be causing this behaviour?

    Yes, as noted in comments, you are using a completely inappropriate block size which is wasting approximately 85% of the potential computational capacity of your GPU.

    1. Am I missing something important?

    Yes, the answer to this question. Setting correct execution parameters is about 50% of the practical performance tuning requirements in CUDA, and you should be able to obtain noticeable performance improvements just be selecting a sane block size. Beyond this, careful profiling should be your next port of call.

    [This answer assembled from comments and added as community wiki entry to get this (very broad) question off the unanswered list in the absence of enough close votes to close it].

    这篇关于CUDA并行光线追踪:非常低的加速的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆