Tensorflow:GPU加速仅在首次运行后发生 [英] Tensorflow: GPU Acceleration only happens after first run

查看:341
本文介绍了Tensorflow:GPU加速仅在首次运行后发生的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在计算机(Ubuntu 16.04)上与tensorflow-gpu一起安装了CUDA和CUDNN.

I've installed CUDA and CUDNN on my machine (Ubuntu 16.04) alongside tensorflow-gpu.

使用的版本:CUDA 10.0,CUDNN 7.6,Python 3.6,Tensorflow 1.14

Versions used: CUDA 10.0, CUDNN 7.6, Python 3.6, Tensorflow 1.14

这是nvidia-smi的输出,显示了视频卡的配置.

This is the output from nvidia-smi, showing the video card configuration.

| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 960M    On   | 00000000:02:00.0 Off |                  N/A |
| N/A   44C    P8    N/A /  N/A |    675MiB /  4046MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1502      G   /usr/lib/xorg/Xorg                           363MiB |
|    0      3281      G   compiz                                        96MiB |
|    0      4375      G   ...uest-channel-token=14359313252217012722    69MiB |
|    0      5157      C   ...felipe/proj/venv/bin/python3.6            141MiB |
+-----------------------------------------------------------------------------+

这是device_lib.list_local_devices()的输出(tensorflow帮助器方法,以显示它可以看到的设备),表明我的GPU对tensorflow可见:

This is the output from device_lib.list_local_devices() (tensorflow helper method to show what devices it can see), showing that my GPU is visible to tensorflow:

[name: "/device:CPU:0"
  device_type: "CPU"
  memory_limit: 268435456
  locality {
  }
  incarnation: 5096693727819965430, 
name: "/device:XLA_GPU:0"
  device_type: "XLA_GPU"
  memory_limit: 17179869184
  locality {
  }
  incarnation: 13415556283266501672
  physical_device_desc: "device: XLA_GPU device", 
name: "/device:XLA_CPU:0"
  device_type: "XLA_CPU"
  memory_limit: 17179869184
  locality {
  }
  incarnation: 14339781620792127180
  physical_device_desc: "device: XLA_CPU device", 
name: "/device:GPU:0"
  device_type: "GPU"
  memory_limit: 3464953856
  locality {
    bus_id: 1
    links {
    }
  }
  incarnation: 13743207545082600644
  physical_device_desc: "device: 0, name: GeForce GTX 960M, pci bus id: 0000:02:00.0, compute capability: 5.0"
]

现在将实际使用GPU进行计算.我使用一小段代码在CPU和GPU上运行一些虚拟矩阵乘法,以比较性能:

Now as for actually using the GPU for computations. I've used a small piece of code to run some dummy matrix multiplications on the CPUs and on the GPUs, to compare the performance:

shapes = [(50, 50), (100, 100), (500, 500), (1000, 1000), (10000,10000), (15000,15000)]

devices = ['/device:CPU:0', '/device:XLA_GPU:0']

for device in devices:
    for shape in shapes:
        with tf.device(device):
            random_matrix = tf.random_uniform(shape=shape, minval=0, maxval=1)
            dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix))
            sum_operation = tf.reduce_sum(dot_operation)

        # Time the actual runtime of the operations
        start_time = datetime.now()
        with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session:
            result = session.run(sum_operation)
        elapsed_time = datetime.now() - start_time

        # PRINT ELAPSED TIME, SHAPE AND DEVICE USED       

这是惊喜.第一次运行包含此代码块的单元(我在jupyter笔记本上)时, GPU计算所需的时间比CPU长得多:

Here is the surprise. The first time I run the cell containing this block of code (I'm on a jupyter notebook), the GPU computations take much longer than the CPU:

# output of first run: CPU is faster
----------------------------------------
Input shape: (50, 50) using Device: /device:CPU:0 took: 0.01
Input shape: (100, 100) using Device: /device:CPU:0 took: 0.01
Input shape: (500, 500) using Device: /device:CPU:0 took: 0.01
Input shape: (1000, 1000) using Device: /device:CPU:0 took: 0.02
Input shape: (10000, 10000) using Device: /device:CPU:0 took: 6.22
Input shape: (15000, 15000) using Device: /device:CPU:0 took: 21.23
----------------------------------------
Input shape: (50, 50) using Device: /device:XLA_GPU:0 took: 2.82
Input shape: (100, 100) using Device: /device:XLA_GPU:0 took: 0.17
Input shape: (500, 500) using Device: /device:XLA_GPU:0 took: 0.18
Input shape: (1000, 1000) using Device: /device:XLA_GPU:0 took: 0.20
Input shape: (10000, 10000) using Device: /device:XLA_GPU:0 took: 28.36
Input shape: (15000, 15000) using Device: /device:XLA_GPU:0 took: 93.73
----------------------------------------

惊喜2 :当我重新运行包含伪矩阵乘法代码的单元时,GPU版本要快得多(如预期):

Surprise #2: When I rerun the cell containing the dummy matrix multiplication code, the GPU version is much faster (as expected):

# output of reruns: GPU is faster
----------------------------------------
Input shape: (50, 50) using Device: /device:CPU:0 took: 0.02
Input shape: (100, 100) using Device: /device:CPU:0 took: 0.02
Input shape: (500, 500) using Device: /device:CPU:0 took: 0.02
Input shape: (1000, 1000) using Device: /device:CPU:0 took: 0.04
Input shape: (10000, 10000) using Device: /device:CPU:0 took: 6.78
Input shape: (15000, 15000) using Device: /device:CPU:0 took: 24.65
----------------------------------------
Input shape: (50, 50) using Device: /device:XLA_GPU:0 took: 0.14
Input shape: (100, 100) using Device: /device:XLA_GPU:0 took: 0.12
Input shape: (500, 500) using Device: /device:XLA_GPU:0 took: 0.13
Input shape: (1000, 1000) using Device: /device:XLA_GPU:0 took: 0.14
Input shape: (10000, 10000) using Device: /device:XLA_GPU:0 took: 1.64
Input shape: (15000, 15000) using Device: /device:XLA_GPU:0 took: 5.29
----------------------------------------

所以我的问题是:为什么只有运行代码一次才真正发生GPU加速?

我可以看到GPU正确设置(否则根本不会加速).是由于某种初始开销造成的吗?在我们真正使用GPU之前,是否需要热身?

I can see the GPU is correctly set up (otherwise no acceleration would happen at all). Is it due to some sort of initial overhead? Do GPUs need to warm-up before we can actually use them?

PS:在这两次运行中(即GPU速度较慢而下一次GPU速度较快),我可以看到GPU的使用率为100%,因此肯定是使用.

P.S.: On both runs (i.e. the one where the GPU was slower and the next ones, where the GPU was faster), I could see GPU Usage was 100%, so it was definitely being used.

P.S.:只有在第一次运行时,GPU似乎才没有被拾取.如果我随后将其运行两次,三遍或多次,那么所有运行都将在第一个运行成功之后进行(即GPU计算更快).

P.S.: Only in the very first run does it seem the GPU isn't get picked up. If I then run it two, three or multiple times, all runs after the first one are successful (i.e. GPU computation is faster).

推荐答案

robert-crovella's comment made me look into the XLA thing, which helped me find the solution.

可以通过两种方式将GPU映射到Tensorflow设备:作为XLA设备和作为普通GPU.

Turns out the GPU is mapped to a Tensorflow device in two ways: as XLA device and as a normal GPU.

这就是为什么有两台设备的原因,一台设备名为"/device:XLA_GPU:0",另一台设备名为"/device:GPU:0".

This is why there were two devices, one named "/device:XLA_GPU:0" and the other "/device:GPU:0".

我要做的只是激活"/device:GPU:0".现在,GPU立即被Tensorflow拾取.

All I needed to do was to activate "/device:GPU:0" instead. Now the GPU gets picked up by Tensorflow immediately.

这篇关于Tensorflow:GPU加速仅在首次运行后发生的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆