GPU如何并行化不同的任务? [英] How does GPU parallelize different tasks?

查看:69
本文介绍了GPU如何并行化不同的任务?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我真的很想了解GPU如何并行化不同的任务,例如实时渲染和训练神经网络.我知道并行化背后的数学原理,但我很好奇GPU的实际工作原理.实时渲染和训练神经网络确实有所不同.GPU如何有效地并行执行这两项任务?

I am really interested to understand how the GPU parallelizes different tasks such as real-time rendering and training the neural networks. I know the math behind parallelization but I am curious to know how GPU actually works. Real-time rendering and training neural networks are really different. How does GPU parallelize these two tasks efficiently?

推荐答案

GPU并行化要求将问题分解为尽可能多的独立,相等的计算(SIMD).C ++看起来像什么

GPU parallelization requires the problem to be split up in as many independent, equal computations as possible (SIMD). What in C++ looks like

void example(float* data, const int N) {
    for(int n=0; n<N; n++) {
        data[n] += 1.0f;
    }
}

OpenCL C中的

看起来像这样:

in OpenCL C looks like this:

kernel void example(global float* data) {
    const int n = get_global_id(0);
    data[n] += 1.0f;
}

一些例子:

对于实时渲染,可通过使用单独的GPU核心绘制每个三角形来由GPU渲染网格化的表面. https://youtu.be/1ww8qRCMc4s

For real-time rendering, a tesselated surface can be rendered by the GPU by drawing every triangle using a seperate GPU core. https://youtu.be/1ww8qRCMc4s

神经网络可以归结为大型矩阵乘法,并且在矩阵内,可以同时独立地并行计算各个列或图块.例如,矢量加法被并行化到尽可能多的矢量分量中,并且每个GPU内核仅计算单个vecotor分量.

Neural networks come down to large matrix multiplications and within a matrix individual colums or tiles can be computed in parallel independently at the same time. Vector additions for example are parallelized in as many vector components as there are and each GPU core computes only a single vecotor component.

基于网格的流体模拟(例如LBM)在假设256x256x256晶格点的3D晶格上工作.对于这16777216个晶格点中的每一个,计算都是相同的,并且由于它们彼此独立,因此可以并行执行.因此,模拟在GPU上划分为16777216个线程,每个格点对应一个线程.如果GPU具有4096个内核,则可以同时计算4096个内核.可以想象,这比在CPU上运行此类任务快几个数量级. https://youtu.be/a1u2g9ahIDk

Lattice based fluid simulations such as LBM work on a 3D lattice of lets say 256x256x256 lattice points. For each of these 16777216‬ lattice points the computations are the same and they can be done concurrently because they are independent of each other. So the simulation is split up to 16777216‬ threads on the GPU, one for every lattice point. If the GPU has 4096 cores, it can compute 4096 of these concurrently. As you can imagine, this is orders of magnitude faster than running such tasks on CPUs. https://youtu.be/a1u2g9ahIDk

粒子模拟可以在单独的GPU内核上计算每个粒子.只要粒子大部分是独立的,这就起作用. https://youtu.be/8Szib8Km5Mo

A particle simulation can compute each particle on a separate GPU core. This works as long as the particles are mostly independent. https://youtu.be/8Szib8Km5Mo

为获得良好的饱和度,以达到最高效率,线程数应比可用的GPU内核数大得多.例如,分支也会对性能造成影响,因为在32个GPU内核的组中,如果一个是 true 分支,而所有其他内核都在 false 分支中,则两个分支都必须由组内的所有核心计算.在网格化的表面渲染示例中,如果三角形的大小差异很大,则性能会受到类似的影响:整个团队都必须等待三角形最大的一个GPU内核完成.但是,如果所有三角形的大小都大致相同,则性能非常好.

For good saturation, to reach maximum efficiency, the number of threads should be much larger than the number of GPU cores available. Also branching for example takes a performance hit because in groups of 32 GPU cores, if one is the true branch and all the others are in the false branch, both branches have to be computed by all cores within the group. In the tesselated surface rendering example, if the triangles have vastly different sizes, performance takes hit for a similar reason: the entire group has to wait for the one GPU core with the largest triangle to finish. If all triangles are approximately the same size however, performance is very good.

这篇关于GPU如何并行化不同的任务?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆