带有Tensorflow和Keras的CUDA_ERROR_LAUNCH_FAILED [英] CUDA_ERROR_LAUNCH_FAILED with Tensorflow and Keras

查看:109
本文介绍了带有Tensorflow和Keras的CUDA_ERROR_LAUNCH_FAILED的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Keras使用fit_generator函数训练卷积神经网络,因为图像存储在.h5文件中,并且不适合内存.大多数情况下,由于模型卡在第一个时期的中间,我无法训练模型,否则会崩溃,并说"GPU同步失败"或"CUDA_ERROR_LAUNCH_FAILED"(请参阅​​下面的日志).使用CPU的训练效果很好,但当然会慢一些.我使用的是两台不同的机器,并且都有相同的问题.我的猜测是,这是与安装/配置有关的问题,但我不知道如何解决.

I'm using Keras to train a convolutional neural network using the fit_generator function as the images are stored in .h5 files and don't fit in memory. Most of the times I'm not able to train the model as it gets stuck in the middle of the first epoch, or it crashes saying 'GPU sync failed' or 'CUDA_ERROR_LAUNCH_FAILED' (see the logs below). The training using the CPUs works well but of course it is slower. I'm using two different machines and both have the same issues. My guess is that it is an installation/configuration related problem but I don't know how to fix it.

在两台机器上都按以下说明安装了Tensorflow: https://www.anaconda.com/blog/developer-blog/tensorflow-in-anaconda/

On both machines Tensorflow was installed as explained here: https://www.anaconda.com/blog/developer-blog/tensorflow-in-anaconda/

我已使用此脚本 https://github.com/tensorflow/tensorflow/blob/master/tools/tf_env_collect.sh 来收集以下信息.

I have used this script https://github.com/tensorflow/tensorflow/blob/master/tools/tf_env_collect.sh to collect the following informations.

在这里tf_env.txt

Here the tf_env.txt

First machine:

Keras 2.2.4. 

== cat /etc/issue ===============================================
Linux liph02.novalocal 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

== are we in docker =============================================
No

== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


== uname -a =====================================================
Linux liph02.novalocal 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy                    1.15.4    
numpydoc                 0.8.0     
protobuf                 3.6.1     
tensorflow               1.12.0    

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.12.0
tf.GIT_VERSION = b'unknown'
tf.COMPILER_VERSION = b'unknown'
Sanity check: array([1], dtype=int32)

== env ==========================================================
LD_LIBRARY_PATH /usr/local/cuda-9.2/lib64
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
Fri Dec 28 16:13:39 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:00:06.0 Off |                  N/A |
| 22%   38C    P0    57W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

== cuda libs  ===================================================
/usr/local/Wolfram/Mathematica/11.3/SystemFiles/Components/MXNetLink/LibraryResources/Linux-x86-64/libcudart.so.9.1


Second machine:

Keras 2.2.4.


== cat /etc/issue ===============================================
Linux liph01.novalocal 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

== are we in docker =============================================
No

== compiler =====================================================
c++ (GCC) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


== uname -a =====================================================
Linux liph01.novalocal 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
msgpack-numpy                      0.4.3.2    
numpy                              1.15.3     
numpydoc                           0.8.0      
protobuf                           3.6.0      
tensorflow                         1.11.0     

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.11.0
tf.GIT_VERSION = b'unknown'
tf.COMPILER_VERSION = b'unknown'

== env ==========================================================
LD_LIBRARY_PATH is unset
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
Thu Jan  3 17:38:44 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:00:07.0 Off |                  N/A |
| 40%   65C    P2    94W / 250W |  11747MiB / 12196MiB |     90%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     16991      C   python                                     11737MiB |
+-----------------------------------------------------------------------------+

== cuda libs  ===================================================
/usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart_static.a
/usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart.so.9.2.148
/usr/local/cuda-9.2/doc/man/man7/libcudart.7
/usr/local/cuda-9.2/doc/man/man7/libcudart.so.7

这是两个堆栈跟踪

(dev) -bash-4.2$ python classifier_training.py --dirs /data/simulations/Paranal_gam/ /data/simulations/Paranal_prot/ --epochs 1 --batch_size 32 --workers 16 --model ClassifierV2 --patience 1
Using TensorFlow backend.
ClassifierV2
Building training generator...
Building validation generator...
2018-12-18 12:15:19.553286: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2018-12-18 12:15:20.043811: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-12-18 12:15:20.047991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:00:06.0
totalMemory: 11.91GiB freeMemory: 11.75GiB
2018-12-18 12:15:20.048093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
Traceback (most recent call last):
  File "classifier_training.py", line 122, in <module>
    model = class_v2.get_model()
  File "/data/ctasoft/cta-lstchain/cnn/classifiers.py", line 40, in get_model
    self.model.add(Conv2D(16, kernel_size=(3, 3), input_shape=(1, self.img_rows, self.img_cols),  data_format='channels_first', activation='relu'))
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/engine/sequential.py", line 165, in add
    layer(x)
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/engine/base_layer.py", line 457, in __call__
    output = self.call(inputs, **kwargs)
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/layers/convolutional.py", line 171, in call
    dilation_rate=self.dilation_rate)
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 3641, in conv2d
    x, tf_data_format = _preprocess_conv2d_input(x, data_format)
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 3521, in _preprocess_conv2d_input
    if not _has_nchw_support() or force_transpose:
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 292, in _has_nchw_support
    gpus_available = len(_get_available_gpus()) > 0
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 278, in _get_available_gpus
    _LOCAL_DEVICES = get_session().list_devices()
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 186, in get_session
    _SESSION = tf.Session(config=config)
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1551, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 676, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: unspecified launch failure


(dev) -bash-4.2$ python classifier_training.py --dirs /data/simulations/Paranal_gam /data/simulations/Paranal_prot --workers 1 --epochs 10 --batch_size 16 --model ClassifierV2 --patience 9
Using TensorFlow backend.
ClassifierV2
Building training generator...
Building validation generator...
2018-12-29 19:29:11.142008: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2018-12-29 19:29:11.892617: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning
 NUMA node zero
2018-12-29 19:29:11.896828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:00:06.0
totalMemory: 11.91GiB freeMemory: 11.75GiB
2018-12-29 19:29:11.896880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-29 19:29:12.960736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-29 19:29:12.960804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2018-12-29 19:29:12.960819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2018-12-29 19:29:12.961681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11366 MB memory) -> physical GPU (device:
0, name: TITAN Xp, pci bus id: 0000:00:06.0, compute capability: 6.1)
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv2d_1 (Conv2D)            (None, 16, 98, 98)        160
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 16, 96, 96)        2320
_________________________________________________________________
average_pooling2d_1 (Average (None, 16, 48, 48)        0
_________________________________________________________________
dropout_1 (Dropout)          (None, 16, 48, 48)        0
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 32, 46, 46)        4640
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 32, 44, 44)        9248
_________________________________________________________________
average_pooling2d_2 (Average (None, 32, 22, 22)        0
_________________________________________________________________
dropout_2 (Dropout)          (None, 32, 22, 22)        0
_________________________________________________________________
flatten_1 (Flatten)          (None, 15488)             0
_________________________________________________________________
dense_1 (Dense)              (None, 128)               1982592
_________________________________________________________________
dropout_3 (Dropout)          (None, 128)               0
_________________________________________________________________
dense_2 (Dense)              (None, 256)               33024
_________________________________________________________________
dropout_4 (Dropout)          (None, 256)               0
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 257
=================================================================
Total params: 2,032,241
Trainable params: 2,032,241
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
   4/8065 [..............................] - ETA: 1:52:06 - loss: 0.9940 - acc: 0.4531 - precision: 0.4947 - recall: 0.71882018-12-29 19:29:54.459471: E tensorflow/stream_executor/cuda/cuda_event.cc:48] E
rror polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2018-12-29 19:29:54.459645: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1
Aborted

推荐答案

在第一台计算机上看起来像 CUDA 版本不匹配,请确保使用单个版本的CUDA,在第二台计算机上确保使用 CUDA cuDNN 的设置不正确.请遵循 Tensorflow 中提到的说明,并带有GPU支持.还要检查 NVIDIA 驱动程序的计算能力,并相应地安装 CUDA .

Looks like on the First machine, CUDA version mismatch,Make sure use single version of CUDA and on the second machine the variables of CUDA and cuDNN are not set properly. Follow the instructions mentioned on Tensorflow with GPU support. Also check the NVIDIA driver compute capability and install CUDA accordingly.

这篇关于带有Tensorflow和Keras的CUDA_ERROR_LAUNCH_FAILED的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆