TensorFlow是否使用GPU上的所有硬件? [英] Does TensorFlow use all of the hardware on the GPU?

查看:86
本文介绍了TensorFlow是否使用GPU上的所有硬件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,不仅GPC和SMS并非是独立的硬件,甚至 TPC 也是重组硬件体系结构并花哨的营销名称的另一种方式.您可以清楚地看到,TPC在图中没有添加任何新内容,只是看起来像是SM的容器.其[1 GPC]:[5 TPC]:[10 SM]

内存控制器是所有硬件所必需的,以便与RAM接口,碰巧有更多的内存控制器可以启用更高的带宽,请参见下图:

其中高带宽内存"是指

显然,它们与我发布的第一张图片中所示的TPC完全不同,至少根据该图,TPC没有与之关联的额外功能,而仅仅是两个SM的容器.

现在,尽管您可以 解决cuda中的纹理功能,但您通常不需要.纹理单元固定功能对神经网络不是很有用,但是,空间连贯的纹理内存通常由CUDA自动地用作优化,即使您没有明确尝试访问它.这样,TensorFlow仍然不会浪费"芯片.

The NVidia GP100 has 30 TPC circuits and 240 "texture units". Do the TPCs and texture units get used by TensorFlow, or are these disposable bits of silicon for machine learning?

I am looking at GPU-Z and Windows 10's built-in GPU performance monitor on a running neural net training session and I see various hardware functions are underutilized. Tensorflow uses CUDA. CUDA has access, I presume, to all hardware components. If I know where the gap is (between Tensorflow and underlying CUDA) and whether it is material (how much silicon is wasted) I can, for example, remediate by making a clone of TensorFlow, modifying it, and then submitting a pull request.

For example, answer below discusses texture objects, accessible from CUDA. NVidia notes that these can be used to speed up latency-sensitive, short-running kernels. If I google "TextureObject tensorflow" I don't get any hits. So I can sort of assume, barring evidence to the contrary, that TensorFlow is not taking advantage of TextureObjects.

NVidia markets GPGPUs for neural net training. So far it seems they have adopted a dual-use strategy for their circuits, so they are leaving in circuits not used for machine learning. This begs the question of whether a pure TensorFlow circuit would be more efficient. Google is now promoting TPUs for this reason. The jury is out on whether TPUs are actually cheaper for TensorFlow than NVidia GPUs. NVidia is challenging Google price/performance claims.

解决方案

None of those things are separate pieces of individual hardware that can be addressed separately in CUDA. Read this passage on page 10 of your document:

Each GPC inside GP100 has ten SMs. Each SM has 64 CUDA Cores and four texture units. With 60 SMs, GP100 has a total of 3840 single precision CUDA Cores and 240 texture units. Each memory controller is attached to 512 KB of L2 cache, and each HBM2 DRAM stack is controlled by a pair of memory controllers. The full GPU includes a total of 4096 KB of L2 cache.

And if we read just above that:

GP100 was built to be the highest performing parallel computing processor in the world to address the needs of the GPU accelerated computing markets serviced by our Tesla P100 accelerator platform. Like previous Tesla-class GPUs, GP100 is composed of an array of Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), and memory controllers. A full GP100 consists of six GPCs, 60 Pascal SMs, 30 TPCs (each including two SMs), and eight 512-bit memory controllers (4096 bits total).

and take a look at the diagram we see the following:

So not only are the GPCs and SMS not seperate pieces of hardware, but even the TPCs are just another way to reorganize the hardware architecture and come up with a fancy marketing name. You can clearly see TPC doesn't add anything new in the diagram, it just looks like a container for the SMs. Its [1 GPC]:[5 TPCs]:[10 SMs]

The memory controllers are something all hardware is going to have in order to interface with RAM, it happens that more memory controllers can enable higher bandwidth, see this diagram:

where "High bandwidth memory" refers to HBM2 a type of video memory like GDDR5, in other words, video RAM. This isn't something you would directly address in software with CUDA any more than you would do so with X86 desktop machines.

So in reality, we only have SMs here, not TPCs an GPCs. So to answer your question, since Tensor flow takes advantage of cuda, presumably its going to use all the available hardware it can.

EDIT: The poster edited their question to an entirely different question, and has new misconceptions there so here is the answer to that:

Texture Processing Clusters (TPCs) and Texture units are not the same thing. TPCs appear to be merely an organization of Streaming Multiprocessors (SM) with a bit of marketing magic thrown in.

Texture units are not a concrete term, and features differ from GPU to GPU, but basically you can think of them as the combination of texture memory or ready access to texture memory, which employs spatial coherence, versus L1,L2,L3... cache which employ temporal coherence, in combination of some fixed function functionality. Fixed functionality may include interpolation access filter (often at least linear interpolation), different coordinate modes, mipmapping control and ansiotropic texture filtering. See the Cuda 9.0 Guide on this topic to get an idea of texture unit functionality and what you can control with CUDA. On the diagram we can see the texture units at the bottom.

Clearly these are completely different from the TPCs shown in the first picture I posted, which at least according to the diagram have no extra functionality associated with them and are merely a container for two SMs.

Now, despite the fact that you can address texture functionality within cuda, you often don't need to. The texture units fixed function functionality is not all that useful to Neural nets, however, the spatially coherent texture memory is often automatically used by CUDA as an optimization even if you don't explicitly try to access it. In this way, TensorFlow still would not be "wasting" silicon.

这篇关于TensorFlow是否使用GPU上的所有硬件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆