为什么要在 CPU 而不是 GPU 上进行预处理? [英] Why should preprocessing be done on CPU rather than GPU?

查看:62
本文介绍了为什么要在 CPU 而不是 GPU 上进行预处理?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,建议:

<块引用>

您在 CPU 上进行预处理,其中 CPU 进行初始处理计算,准备并发送其余的重复并行任务到 GPU 以供进一步处理.

我曾经开发了一种缓冲机制来增加 CPU 和 GPU 之间的数据处理,从而减少 CPU 和 GPU 之间延迟的负面影响.看看这篇论文以更好地理解这个问题:

2],

<块引用>

当在 GPU 上进行预处理时,数据流是 CPU -> GPU(预处理)-> CPU -> GPU(训练).数据被反弹回来介于 CPU 和 GPU 之间.

如果你还记得上面提到的 CPU-Memory-GPU 之间的数据流图,在 CPU 上做预处理提高性能的原因是:

  • 在 GPU 上计算节点后,数据被发送回内存CPU 获取该内存以进行进一步处理.GPU没有足够的板载内存(在 GPU 本身上)以将所有数据保存在其上以用于计算目的.所以数据的来回是不可避免的.为了优化此数据流,您在 CPU 上进行预处理,然后将为可并行任务准备的数据(用于训练)发送到内存和 GPU获取预处理数据并对其进行处理.

在性能指南本身中,它也提到通过这样做,并拥有高效的输入管道,您不会饿死 CPU 或 GPU 或两者,这本身就证明了上述逻辑.同样,在同一个性能文档中,您还会看到提到

<块引用>

如果您的训练循环在使用 SSD 与 HDD 进行存储时运行得更快您的输入数据,您可能会遇到 I/O 瓶颈.如果这是在这种情况下,你应该预处理你的输入数据,创建一些大的TFRecord 文件.

这再次试图提及相同的 CPU-Memory-GPU 性能瓶颈,也就是上面提到的.

希望这会有所帮助,如果您需要更多说明(关于 CPU-GPU 性能),请随时留言!

参考:

[2] Tensorflow 性能指南:https://www.tensorflow.org/performance/performance_guide

The performance guide advises to do the preprocessing on CPU rather that on GPU. The listed reasons are

  1. This prevent the data from going from CPU to GPU to CPU to GPU back again.
  2. This frees the GPU of these tasks to focus on training.

I am not sure to understand either arguments.

  1. Why would preprocessing send the result back to the CPU, esp. if all nodes are on GPU? Why preprocessing operations and not any other operation on the graph, why are they/should they be special?
  2. Even though I understand the rationale behind putting the CPU to work rather than keeping it idle, compared to the huge convolutions and other gradient backpropagation a training step has to do, I would have assumed that random cropping, flip and other standard preprocessing steps on the input images should be nowhere near in term of computation needs, and should be executed in a fraction of the time. Even if we think of preprocessing as mostly moving things around (crop, flips), I think GPU memory should be faster for that. Yet doing preprocessing on the CPU can yield a 6+-fold increase in throughput according to the same guide.

I am assuming of course that preprocessing does not result in a drastic decrease in size of the data (e.g. supersampling or cropping to a much smaller size), in which case the gain in transfer time to the device is obvious. I suppose these are rather extreme cases and do not constitute the basis for the above recommendation.

Can somebody make sense out of this?

解决方案

It is based on the same logic on how CPU and GPU works. GPU is good at doing repetitive parallelised tasks very well, whereas CPU is good at other computations, which require more processing capabilities.

For example, consider a program, which accepts inputs of two integers from the user and runs a for-loop for 1 Million times to sum the two numbers.

How we can achieve this with the combination of CPU and GPU processing?

We do the initial data (two user input integers) intercept part from the user on CPU and then send the two numbers to GPU and the for-loop to sum the numbers runs on the GPU because that is the repetitive, parallelizable yet simple computation part, which GPU is better at. [Although this example wasn't really exactly related to tensorflow but this concept is the heart of all CPU and GPU processing. Regarding your query: Processing abilities like random cropping, flip and other standard preprocessing steps on the input images might not be computational intensive but GPU doesn't excel in such kind of interrupt related computation either.]

Another thing we need to keep in mind that the latency between CPU and GPU also plays a key role here. Copying and transferring data to and fro CPU and GPU is expensive if compared to the transfer of data between different cache levels inside CPU.

As Dey, 2014 [1] have mentioned:

When a parallelized program is computed on the GPGPU, first the data is copied from the memory to the GPU and after computation the data is written back to the memory from the GPU using the PCI-e bus (Refer to Fig. 1.8). Thus for every computation, data has to be copied to and fro device-host-memory. Although the computation is very fast in GPGPU, but because of the gap between the device-host-memory due to communication via PCI-e, the bottleneck in the performance is generated.

For this reason it is advisable that:

You do the preprocessing on CPU, where the CPU does the initial computation, prepares and sends the rest of the repetitive parallelised tasks to the GPU for further processing.

I once developed a buffer mechanism to increase the data processing between CPU and GPU, and henceforth reduce the negative effects of latency between CPU and GPU. Have a look at this thesis to gain a better understanding of this issue:

EFFICIENT DATA INPUT/OUTPUT (I/O) FOR FINITE DIFFERENCE TIME DOMAIN (FDTD) COMPUTATION ON GRAPHICS PROCESSING UNIT (GPU)

Now, to answer your question:

Why would preprocessing send the result back to the CPU, esp. if all nodes are on GPU?

As quoted from the performance guide of Tensorflow [2],

When preprocessing occurs on the GPU the flow of data is CPU -> GPU (preprocessing) -> CPU -> GPU (training). The data is bounced back and forth between the CPU and GPU.

If you remember the dataflow diagram between the CPU-Memory-GPU mentioned above, the reason for doing the preprocessing on CPU improves performance because:

  • After computation of nodes on GPU, data is sent back on the memory and CPU fetches that memory for further processing. GPU does not have enough memory on-board (on GPU itself) to keep all the data on it for computational prupose. So back-and-forth of data is inevitable. To optimise this data flow, you do preprocessing on CPU, then the data (for training purposes), which is prepared for parallelizable tasks, is sent to the memory and GPU fetches that preprocessed data and work on it.

In the performance guide itself it also mentions that by doing this, and having an efficient input pipeline, you won't starve either CPU or GPU or both, which itself proves the aforementioned logic. Again, in the same performance doc, you will also see the mentioning of

If your training loop runs faster when using SSDs vs HDDs for storing your input data, you could could be I/O bottlenecked.If this is the case, you should pre-process your input data, creating a few large TFRecord files.

This again tries to mention the same CPU-Memory-GPU performance bottleneck, which is mentioned above.

Hope this helps and in case you need more clarification (on CPU-GPU performance), do not hesitate to drop a message!

References:

[1] Somdip Dey, EFFICIENT DATA INPUT/OUTPUT (I/O) FOR FINITE DIFFERENCE TIME DOMAIN (FDTD) COMPUTATION ON GRAPHICS PROCESSING UNIT (GPU), 2014

[2] Tensorflow Performance Guide: https://www.tensorflow.org/performance/performance_guide

这篇关于为什么要在 CPU 而不是 GPU 上进行预处理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆