带有Tensorflow和Keras的CUDA_ERROR_LAUNCH_FAILED [英] CUDA_ERROR_LAUNCH_FAILED with Tensorflow and Keras
问题描述
我正在使用Keras使用fit_generator函数训练卷积神经网络,因为图像存储在.h5文件中,并且不适合内存.大多数情况下,由于模型卡在第一个时期的中间,我无法训练模型,否则会崩溃,并说"GPU同步失败"或"CUDA_ERROR_LAUNCH_FAILED"(请参阅下面的日志).使用CPU的训练效果很好,但当然会慢一些.我使用的是两台不同的机器,并且都有相同的问题.我的猜测是,这是与安装/配置有关的问题,但我不知道如何解决.
I'm using Keras to train a convolutional neural network using the fit_generator function as the images are stored in .h5 files and don't fit in memory. Most of the times I'm not able to train the model as it gets stuck in the middle of the first epoch, or it crashes saying 'GPU sync failed' or 'CUDA_ERROR_LAUNCH_FAILED' (see the logs below). The training using the CPUs works well but of course it is slower. I'm using two different machines and both have the same issues. My guess is that it is an installation/configuration related problem but I don't know how to fix it.
在两台机器上都按以下说明安装了Tensorflow: https://www.anaconda.com/blog/developer-blog/tensorflow-in-anaconda/
On both machines Tensorflow was installed as explained here: https://www.anaconda.com/blog/developer-blog/tensorflow-in-anaconda/
我已使用此脚本 https://github.com/tensorflow/tensorflow/blob/master/tools/tf_env_collect.sh 来收集以下信息.
I have used this script https://github.com/tensorflow/tensorflow/blob/master/tools/tf_env_collect.sh to collect the following informations.
在这里tf_env.txt
Here the tf_env.txt
First machine:
Keras 2.2.4.
== cat /etc/issue ===============================================
Linux liph02.novalocal 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
== are we in docker =============================================
No
== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
== uname -a =====================================================
Linux liph02.novalocal 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
== check pips ===================================================
numpy 1.15.4
numpydoc 0.8.0
protobuf 3.6.1
tensorflow 1.12.0
== check for virtualenv =========================================
False
== tensorflow import ============================================
tf.VERSION = 1.12.0
tf.GIT_VERSION = b'unknown'
tf.COMPILER_VERSION = b'unknown'
Sanity check: array([1], dtype=int32)
== env ==========================================================
LD_LIBRARY_PATH /usr/local/cuda-9.2/lib64
DYLD_LIBRARY_PATH is unset
== nvidia-smi ===================================================
Fri Dec 28 16:13:39 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:00:06.0 Off | N/A |
| 22% 38C P0 57W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
== cuda libs ===================================================
/usr/local/Wolfram/Mathematica/11.3/SystemFiles/Components/MXNetLink/LibraryResources/Linux-x86-64/libcudart.so.9.1
Second machine:
Keras 2.2.4.
== cat /etc/issue ===============================================
Linux liph01.novalocal 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
== are we in docker =============================================
No
== compiler =====================================================
c++ (GCC) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
== uname -a =====================================================
Linux liph01.novalocal 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
== check pips ===================================================
msgpack-numpy 0.4.3.2
numpy 1.15.3
numpydoc 0.8.0
protobuf 3.6.0
tensorflow 1.11.0
== check for virtualenv =========================================
False
== tensorflow import ============================================
tf.VERSION = 1.11.0
tf.GIT_VERSION = b'unknown'
tf.COMPILER_VERSION = b'unknown'
== env ==========================================================
LD_LIBRARY_PATH is unset
DYLD_LIBRARY_PATH is unset
== nvidia-smi ===================================================
Thu Jan 3 17:38:44 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:00:07.0 Off | N/A |
| 40% 65C P2 94W / 250W | 11747MiB / 12196MiB | 90% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 16991 C python 11737MiB |
+-----------------------------------------------------------------------------+
== cuda libs ===================================================
/usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart_static.a
/usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart.so.9.2.148
/usr/local/cuda-9.2/doc/man/man7/libcudart.7
/usr/local/cuda-9.2/doc/man/man7/libcudart.so.7
这是两个堆栈跟踪
(dev) -bash-4.2$ python classifier_training.py --dirs /data/simulations/Paranal_gam/ /data/simulations/Paranal_prot/ --epochs 1 --batch_size 32 --workers 16 --model ClassifierV2 --patience 1
Using TensorFlow backend.
ClassifierV2
Building training generator...
Building validation generator...
2018-12-18 12:15:19.553286: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2018-12-18 12:15:20.043811: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-12-18 12:15:20.047991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:00:06.0
totalMemory: 11.91GiB freeMemory: 11.75GiB
2018-12-18 12:15:20.048093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
Traceback (most recent call last):
File "classifier_training.py", line 122, in <module>
model = class_v2.get_model()
File "/data/ctasoft/cta-lstchain/cnn/classifiers.py", line 40, in get_model
self.model.add(Conv2D(16, kernel_size=(3, 3), input_shape=(1, self.img_rows, self.img_cols), data_format='channels_first', activation='relu'))
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/engine/sequential.py", line 165, in add
layer(x)
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/engine/base_layer.py", line 457, in __call__
output = self.call(inputs, **kwargs)
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/layers/convolutional.py", line 171, in call
dilation_rate=self.dilation_rate)
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 3641, in conv2d
x, tf_data_format = _preprocess_conv2d_input(x, data_format)
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 3521, in _preprocess_conv2d_input
if not _has_nchw_support() or force_transpose:
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 292, in _has_nchw_support
gpus_available = len(_get_available_gpus()) > 0
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 278, in _get_available_gpus
_LOCAL_DEVICES = get_session().list_devices()
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 186, in get_session
_SESSION = tf.Session(config=config)
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1551, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 676, in __init__
self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: unspecified launch failure
(dev) -bash-4.2$ python classifier_training.py --dirs /data/simulations/Paranal_gam /data/simulations/Paranal_prot --workers 1 --epochs 10 --batch_size 16 --model ClassifierV2 --patience 9
Using TensorFlow backend.
ClassifierV2
Building training generator...
Building validation generator...
2018-12-29 19:29:11.142008: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2018-12-29 19:29:11.892617: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning
NUMA node zero
2018-12-29 19:29:11.896828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:00:06.0
totalMemory: 11.91GiB freeMemory: 11.75GiB
2018-12-29 19:29:11.896880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-29 19:29:12.960736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-29 19:29:12.960804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-12-29 19:29:12.960819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-12-29 19:29:12.961681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11366 MB memory) -> physical GPU (device:
0, name: TITAN Xp, pci bus id: 0000:00:06.0, compute capability: 6.1)
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 16, 98, 98) 160
_________________________________________________________________
conv2d_2 (Conv2D) (None, 16, 96, 96) 2320
_________________________________________________________________
average_pooling2d_1 (Average (None, 16, 48, 48) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 16, 48, 48) 0
_________________________________________________________________
conv2d_3 (Conv2D) (None, 32, 46, 46) 4640
_________________________________________________________________
conv2d_4 (Conv2D) (None, 32, 44, 44) 9248
_________________________________________________________________
average_pooling2d_2 (Average (None, 32, 22, 22) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 32, 22, 22) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 15488) 0
_________________________________________________________________
dense_1 (Dense) (None, 128) 1982592
_________________________________________________________________
dropout_3 (Dropout) (None, 128) 0
_________________________________________________________________
dense_2 (Dense) (None, 256) 33024
_________________________________________________________________
dropout_4 (Dropout) (None, 256) 0
_________________________________________________________________
dense_3 (Dense) (None, 1) 257
=================================================================
Total params: 2,032,241
Trainable params: 2,032,241
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
4/8065 [..............................] - ETA: 1:52:06 - loss: 0.9940 - acc: 0.4531 - precision: 0.4947 - recall: 0.71882018-12-29 19:29:54.459471: E tensorflow/stream_executor/cuda/cuda_event.cc:48] E
rror polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2018-12-29 19:29:54.459645: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1
Aborted
推荐答案
在第一台计算机上看起来像 CUDA
版本不匹配,请确保使用单个版本的CUDA,在第二台计算机上确保使用 CUDA
和 cuDNN
的设置不正确.请遵循 Tensorflow 中提到的说明,并带有GPU支持.还要检查 NVIDIA
驱动程序的计算能力,并相应地安装 CUDA
.
Looks like on the First machine, CUDA
version mismatch,Make sure use single version of CUDA and on the second machine the variables of CUDA
and cuDNN
are not set properly. Follow the instructions mentioned on Tensorflow with GPU support.
Also check the NVIDIA
driver compute capability and install CUDA
accordingly.
这篇关于带有Tensorflow和Keras的CUDA_ERROR_LAUNCH_FAILED的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!