Keras对使用GPU的训练速度没有任何改善(部分使用GPU ?!) [英] Keras shows no Improvements to training speed with GPU (partial GPU usage?!)

查看:123
本文介绍了Keras对使用GPU的训练速度没有任何改善(部分使用GPU ?!)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从Jupyter Notebook在GPU上而不是在AWS p2.xlarge实例上的CPU上训练模型.我正在使用tensorflow-gpu后端(仅安装tensorflow-gpu并在requirements.txt中提到,而不是tensorflow).

I am trying to train my model on a GPU instead of a CPU on an AWS p2.xlarge instance from my Jupyter Notebook. I am using tensorflow-gpu backend (only tensorflow-gpu was installed and mentioned in requirements.txt and not tensorflow).

与在使用CPU的情况下相比,在这些实例上训练模型时,我看不到任何速度上的提高,实际上,我获得的训练速度几乎与我在4核笔记本电脑上获得的速度相同(p2.xlarge有4个带有Tesla K80 GPU的vCPU).我不确定是否需要对代码进行一些更改以适应GPU可以提供的更快/并行处理.我正在为我的模型粘贴以下代码:

I am not seeing any speed improvements when training models on these instances compared to using a CPU, infact I am getting training speeds per epoch that is almost same as I am getting on my 4-core laptop CPU (p2.xlarge also has 4 vCPUs with a Tesla K80 GPU). I am not sure if i need to do some changes to my code to accommodate faster/parallel processing that GPU can offer. I am pasting below my code for my model:

model = Sequential()
model.add(recurrent.LSTM(64, input_shape=(X_np.shape[1], X_np.shape[2]),
                        return_sequences=True))
model.add(recurrent.LSTM(64, return_sequences = False))
model.add(core.Dropout(0.1))
model.add(core.Dense(3, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer = 'rmsprop', metrics=['accuracy'])

model.fit(X_np, y_np, epochs=100, validation_split=0.25)

有趣的是,每次我使用nvidia-smi检查GPU状态时,GPU似乎都在利用其50%-60%的处理能力和几乎所有的内存(但是当不使用GPU时,两者都分别下降到0%和1MiB)培训):

Also interestingly the GPU seems to be utilizing between 50%-60% of its processing power and almost all of its memory every time I check for GPU status using nvidia-smi (but both fall to 0% and 1MiB respectively when not training):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   47C    P0    73W / 149W |  10919MiB / 11439MiB |     52%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1665      C   ...ubuntu/aDash/MLenv/bin/python 10906MiB |
+-----------------------------------------------------------------------------+

此外,如果您想查看有关从Jupyter Notebook使用GPU的日志:

Also if you'd like to see my logs about using the GPU from Jupyter Notebook:

[I 04:21:59.390 NotebookApp] Kernel started: c17bc4d1-fa15-4b0e-b5f0-87f90e56bf65
[I 04:22:02.241 NotebookApp] Adapting to protocol v5.1 for kernel c17bc4d1-fa15-4b0e-b5f0-87f90e56bf65
2017-11-30 04:22:32.403981: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-30 04:22:33.653681: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-30 04:22:33.654041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2017-11-30 04:22:33.654070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2017-11-30 04:22:34.014329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
2017-11-30 04:22:34.015339: I tensorflow/core/common_runtime/direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7

2017-11-30 04:23:22.426895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)

请提出可能的问题.仍然感谢您大量关注!

Please suggest what could be the problem. Thanks a ton for looking at this anyways!

推荐答案

发生这种情况是因为您正在使用LSTM层.

That happens because you're using LSTM layers.

Tensorflow在LSTM层上的实现不是那么好.原因可能是循环计算不是并行计算,并且GPU非常适合并行处理.

Tensorflow's implementation for LSTM layers is not that great. The reason is probably that recurrent calculations are not parallel calculations, and GPUs are great for parallel processing.

我根据自己的经验确认:

I confirmed that by my own experience:

  • 在我的模型中使用LSTM达到了可怕的速度
  • 决定测试删除所有LSTM的模型(得到一个纯卷积模型)
  • 最终的速度简直令人惊讶!

有关使用GPU和tensorflow的这篇文章还证实:

This article about using GPUs and tensorflow also confirms that:

您可以尝试使用新的 CuDNNLSTM ,它似乎是为使用GPU专门准备的.

You may try using the new CuDNNLSTM, which seems prepared specially for using GPUs.

我从未测试过,但是您很可能会因此获得更好的性能.

I never tested it, but you'll most probably get a better performance with this.

我还没有测试过另一件事,我不确定它是出于这个原因而设计的,但是我怀疑是:您可以将unroll=True放在您的LSTM层中.这样,我怀疑循环计算将转换为并行计算.

Another thing that I haven't tested, and I'm not sure it's designed for that reason, but I suspect it is: you can put unroll=True in your LSTM layers. With that, I suspect the recurrent calculations will be transformed in parallel ones.

这篇关于Keras对使用GPU的训练速度没有任何改善(部分使用GPU ?!)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆