在OpenCL中以编程方式选择最佳GPU的最佳方法是什么? [英] What is the best way to programmatically choose the best GPU in OpenCL?

查看:41
本文介绍了在OpenCL中以编程方式选择最佳GPU的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的笔记本电脑上,我有两张图形卡-Intel Iris和Nvidia GeForce GT 750M.我正在尝试使用OpenCL做一个简单的矢量添加.我知道Nvidia卡的速度要快得多,并且可以做得更好.原则上,我可以在代码中放置一个if语句,该语句将在VENDOR属性中查找NVIDIA.但是我想要些优雅的东西.在OpenCL C/C++中以编程方式选择更好(更快)GPU的最佳方法是什么?

On my laptop I have two graphic cards - Intel Iris and Nvidia GeForce GT 750M. I am trying to do a simple vector add using OpenCL. I know, that Nvidia card is much faster and can do the job better. In principle, I can put an if statement in the code that will look for NVIDIA in the VENDOR attribute. But I'd like to have something elegant. What is the best way to choose a better (faster) GPU programmatically in OpenCL C/C++?

推荐答案

我开发了一个实时光线跟踪器(不仅仅是光线投射器),该跟踪器以编程方式选择了两个GPU和一个CPU,并渲染并平衡了所有三个即时的.这是我的方法.

I developed a real-time ray tracer (not just a ray caster) which programmatically chose two GPUs and a CPU and rendered and balanced the load on all three in real time. Here is how I did it.

假设有三个设备,d1d2d3.为每个设备分配权重:w1w2w3.调用要渲染的像素数n.假设有一个名为alpha的自由参数.

Let's say there are three devices, d1, d2, and d3. Assign each device a weight: w1, w2, and w3. Call the number of pixels to be rendered n. Assume a free parameter called alpha.

  1. 为每个设备分配权重为1/3.
  2. alpha = 0.5.
  3. d1上渲染第一个n1=w1*n像素,在d2上渲染下一个n2=w2*n像素,在d3上渲染最后的n3=w3*n像素,并记录每个设备t1的渲染时间, t2t3.
  4. 计算值vsum = n1/t1 + n2/t2 + n3/t3.
  5. 重新计算权重w_i = alpha*w_i + (1-alpha)*n_i/t_i/vsum.
  6. 返回步骤3.
  1. Assign each device a weight of 1/3.
  2. Let alpha = 0.5.
  3. Render the first n1=w1*n pixels on d1, the next n2=w2*n pixels on d2, and the last n3=w3*n pixels on d3 and record the times to render for each deivce t1, t2, and t3.
  4. Calculate a value vsum = n1/t1 + n2/t2 + n3/t3.
  5. Recalcuate the weights w_i = alpha*w_i + (1-alpha)*n_i/t_i/vsum.
  6. Go back to step 3.

alpha的点是为了允许平滑过渡.与其根据所有旧权重中混合的时间来重新分配所有权重.没有使用alpha我就变得不稳定.值alpha可以调整.实际上,可以将其设置为1%左右,而不是0%.

The point of the value alpha is to allow a smooth transition. Instead of reassign all the weight based on the times it mixes in some of the old weight. Without using alpha I got instabilities. The value alpha can be tuned. In practice it can probably be set around 1% but not 0%.

让我们举个例子.

我有一个GTX 590,它是一个双GPU卡,带有两个时钟不足的GTX580.我也有一个Sandy Bridge 2600K处理器. GPU比CPU快得多.假设它们快10倍左右.假设还有900个像素.

I had a GTX 590 which was a dual GPU card with two under-clocked GTX580s. I also had a Sandy Bridge 2600K processor. The GPUs were much faster than the CPU. Let's assume they were about 10 times faster. Let's also say there were 900 pixels.

使用GPU1渲染前300个像素,使用GPU2渲染接下来的300像素,使用CPU1渲染最后300个像素,并分别记录10 s, 10 s, and 100 s的时间.因此,一个GPU用于整个图像的时间为30 s,而仅CPU的时间为300 s.这两个GPU在一起将花费15 s.

Render the first 300 pixels with GPU1, the next 300 pixels with GPU2, and the last 300 pixels with CPU1 and record the times of 10 s, 10 s, and 100 s respectively. So one GPU for the whole image would take 30 s and the CPU alone would take 300 s. Both GPUS together would take 15 s.

计算vsum = 30 + 30 + 3 = 63.再次重新计算权重: w1,w2 = 0.5*(1/3) + 0.5*300/10/63 = 0.4w3 = 0.5*(1/3) + 0.5*300/100/63 = 0.2.

Calculate vsum = 30 + 30 + 3 = 63. Recalculate the weights again: w1,w2 = 0.5*(1/3) + 0.5*300/10/63 = 0.4 and w3 = 0.5*(1/3) + 0.5*300/100/63 = 0.2.

渲染下一帧:使用GPU1渲染360像素,使用GPU2渲染360 PIXELS,使用CPU1渲染180 PIXELS,并且说11 s, 11 s, and 55 s,时间变得更加平衡.

Render the next frame: 360 pixels with GPU1, 360 PIXELS with GPU2, and 180 PIXELS with CPU1 and the times become a bit more balanced say 11 s, 11 s, and 55 s.

在许多帧之后,(1-alpha)项占主导地位,直到最终权重全部基于该项.在这种情况下,权重分别变为47%(427像素),47%,6%(46像素),时间分别变为14 s, 14 s, 14 s.在这种情况下,CPU仅会将仅使用GPU的结果提高一秒钟.

After a number of frames the (1-alpha) term dominates until eventually the weights are all based on that term. In this case the weights become 47% (427 pixels), 47%, 6% (46 pixels) respectively and the times become say 14 s, 14 s, 14 s respectively. In this case the CPU only improves the result of using only the GPUs by one second.

在此计算中,我假设负载是均匀的.在真实的光线跟踪器中,负载随扫描线和像素的不同而变化,但是用于确定权重的算法保持不变.

I assumed a uniform load in this calculate. In a real ray tracer the load varies per scan-line and pixel but the algorithm stays the same for determining the weights.

在实践中,一旦找到权重,除非场景的负载发生显着变化,否则它们不会改变很多.如果场景的一个区域具有较高的折射和反射率,而其余区域是漫反射的,但是即使在这种情况下,我也限制了树的深度,因此不会产生太大的影响.

In practice once the weights are found they don't change much unless the load of the scene changes significantly e.g. if one region of the scene has high refraction and reflection and the rest is diffuse but even in this case I limit the tree depth so this does not have a dramatic effect.

通过循环将这种方法扩展到多个设备很容易.我曾经在四台设备上测试过我的光线跟踪器.两个12核Xeon CPU和两个GPU.在这种情况下,CPU的影响更大,但GPU仍占主导地位.

It's easy to extend this method to multiple devices with a loop. I tested my ray tracer on four devices once. Two 12-core Xeon CPUs and two GPUs. In this case the CPUs had a lot more influence but the GPUs still dominated.

万一有人想知道.我为每个设备创建了一个上下文,并在单独的线程(使用pthreads)中使用了每个上下文.对于三个设备,我使用了三个线程.

In case anyone is wondering. I created a context for each device and used each context in a separate thread (using pthreads). For three devices I used three threads.

实际上,您可以使用它在不同供应商的同一设备上运行.例如,我在2600K上同时使用了AMD和Intel CPU驱动程序(每个驱动程序生成大约一半的帧),以查看哪个供应商更好.当我第一次这样做时(2012年),如果我没记错的话,讽刺的是,AMD在Intel CPU上击败了Intel.

In fact you can use this to run on the same device from different vendors. For example I used both the AMD and Intel CPU drivers simultaneously (each generating about half the frame) on my 2600K to see which vendor was better. When I first did this (2012), if I recall correctly, AMD beat Intel, ironically, on an Intel CPU.

如果有人对我如何提出权重公式感兴趣,我会使用物理学中的一个想法(我的背景是物理学而不是编程).

In case anyone is interested in how I came up with the formula for the weights I used an idea from physics (my background is physics not programming).

速度(v)=距离/时间.在这种情况下,距离(d)是要处理的像素数.那么总距离是

Speed (v) = distance/time. In this case distance (d) is the number of pixels to process. The total distance then is

d = v1*t1 + v2*t2 + v3*t3

我们希望他们每次都同时完成

and we want them to each finish in the same time so

d = (v1 + v2 + v3)*t

然后获得权重定义

v_i*t = w_i*d

给出

w_i = v_i*t/d

并替换(d = (v1 + v2 + v3)*t)中的(t/d)给出:

and replacing (t/d) from (d = (v1 + v2 + v3)*t) gives:

w_i = v_i /(v1 + v2 + v3)

很容易看出这可以推广到任意数量的设备k

It's easy to see this can be generalized to any number of devices k

w_i = v_i/(v1 + v2 + ...v_k)

因此,我算法中的vsum代表速度之和".最后,由于v_i是随时间变化的像素,因此n_i/t_i最终给出了

So vsum in my algorithm stands for "sum of the velocities". Lastly since v_i is pixels over time it's n_i/t_i which finally gives

w_i = n_i/t_i/(n1/t1 + n2/t2 + ...n_k/t_k)

这是我公式中用于计算权重的第二项.

which is the second term in my formula to calculate the weights.

这篇关于在OpenCL中以编程方式选择最佳GPU的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆