带有Tensorflow和Keras的CUDA_ERROR_LAUNCH_FAILED [英] CUDA_ERROR_LAUNCH_FAILED with Tensorflow and Keras

查看：109 发布时间：2021/5/13 18:43:07 python tensorflow keras gpu

本文介绍了带有Tensorflow和Keras的CUDA_ERROR_LAUNCH_FAILED的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Keras使用fit_generator函数训练卷积神经网络，因为图像存储在.h5文件中，并且不适合内存.大多数情况下，由于模型卡在第一个时期的中间，我无法训练模型，否则会崩溃，并说"GPU同步失败"或"CUDA_ERROR_LAUNCH_FAILED"(请参阅下面的日志).使用CPU的训练效果很好，但当然会慢一些.我使用的是两台不同的机器，并且都有相同的问题.我的猜测是，这是与安装/配置有关的问题，但我不知道如何解决.

I'm using Keras to train a convolutional neural network using the fit_generator function as the images are stored in .h5 files and don't fit in memory. Most of the times I'm not able to train the model as it gets stuck in the middle of the first epoch, or it crashes saying 'GPU sync failed' or 'CUDA_ERROR_LAUNCH_FAILED' (see the logs below). The training using the CPUs works well but of course it is slower. I'm using two different machines and both have the same issues. My guess is that it is an installation/configuration related problem but I don't know how to fix it.

在两台机器上都按以下说明安装了Tensorflow: https://www.anaconda.com/blog/developer-blog/tensorflow-in-anaconda/

On both machines Tensorflow was installed as explained here: https://www.anaconda.com/blog/developer-blog/tensorflow-in-anaconda/

我已使用此脚本 https://github.com/tensorflow/tensorflow/blob/master/tools/tf_env_collect.sh 来收集以下信息.

I have used this script https://github.com/tensorflow/tensorflow/blob/master/tools/tf_env_collect.sh to collect the following informations.

在这里tf_env.txt

Here the tf_env.txt

First machine:

Keras 2.2.4. 

== cat /etc/issue ===============================================
Linux liph02.novalocal 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

== are we in docker =============================================
No

== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


== uname -a =====================================================
Linux liph02.novalocal 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy                    1.15.4    
numpydoc                 0.8.0     
protobuf                 3.6.1     
tensorflow               1.12.0    

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.12.0
tf.GIT_VERSION = b'unknown'
tf.COMPILER_VERSION = b'unknown'
Sanity check: array([1], dtype=int32)

== env ==========================================================
LD_LIBRARY_PATH /usr/local/cuda-9.2/lib64
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
Fri Dec 28 16:13:39 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:00:06.0 Off |                  N/A |
| 22%   38C    P0    57W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

== cuda libs  ===================================================
/usr/local/Wolfram/Mathematica/11.3/SystemFiles/Components/MXNetLink/LibraryResources/Linux-x86-64/libcudart.so.9.1


Second machine:

Keras 2.2.4.


== cat /etc/issue ===============================================
Linux liph01.novalocal 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

== are we in docker =============================================
No

== compiler =====================================================
c++ (GCC) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


== uname -a =====================================================
Linux liph01.novalocal 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
msgpack-numpy                      0.4.3.2    
numpy                              1.15.3     
numpydoc                           0.8.0      
protobuf                           3.6.0      
tensorflow                         1.11.0     

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.11.0
tf.GIT_VERSION = b'unknown'
tf.COMPILER_VERSION = b'unknown'

== env ==========================================================
LD_LIBRARY_PATH is unset
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
Thu Jan  3 17:38:44 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:00:07.0 Off |                  N/A |
| 40%   65C    P2    94W / 250W |  11747MiB / 12196MiB |     90%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     16991      C   python                                     11737MiB |
+-----------------------------------------------------------------------------+

== cuda libs  ===================================================
/usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart_static.a
/usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart.so.9.2.148
/usr/local/cuda-9.2/doc/man/man7/libcudart.7
/usr/local/cuda-9.2/doc/man/man7/libcudart.so.7

这是两个堆栈跟踪

(dev) -bash-4.2$ python classifier_training.py --dirs /data/simulations/Paranal_gam/ /data/simulations/Paranal_prot/ --epochs 1 --batch_size 32 --workers 16 --model ClassifierV2 --patience 1
Using TensorFlow backend.
ClassifierV2
Building training generator...
Building validation generator...
2018-12-18 12:15:19.553286: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2018-12-18 12:15:20.043811: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-12-18 12:15:20.047991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:00:06.0
totalMemory: 11.91GiB freeMemory: 11.75GiB
2018-12-18 12:15:20.048093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
Traceback (most recent call last):
  File "classifier_training.py", line 122, in <module>
    model = class_v2.get_model()
  File "/data/ctasoft/cta-lstchain/cnn/classifiers.py", line 40, in get_model
    self.model.add(Conv2D(16, kernel_size=(3, 3), input_shape=(1, self.img_rows, self.img_cols),  data_format='channels_first', activation='relu'))
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/engine/sequential.py", line 165, in add
    layer(x)
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/engine/base_layer.py", line 457, in __call__
    output = self.call(inputs, **kwargs)
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/layers/convolutional.py", line 171, in call
    dilation_rate=self.dilation_rate)
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 3641, in conv2d
    x, tf_data_format = _preprocess_conv2d_input(x, data_format)
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 3521, in _preprocess_conv2d_input
    if not _has_nchw_support() or force_transpose:
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 292, in _has_nchw_support
    gpus_available = len(_get_available_gpus()) > 0
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 278, in _get_available_gpus
    _LOCAL_DEVICES = get_session().list_devices()
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 186, in get_session
    _SESSION = tf.Session(config=config)
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1551, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 676, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: unspecified launch failure


(dev) -bash-4.2$ python classifier_training.py --dirs /data/simulations/Paranal_gam /data/simulations/Paranal_prot --workers 1 --epochs 10 --batch_size 16 --model ClassifierV2 --patience 9
Using TensorFlow backend.
ClassifierV2
Building training generator...
Building validation generator...
2018-12-29 19:29:11.142008: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2018-12-29 19:29:11.892617: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning
 NUMA node zero
2018-12-29 19:29:11.896828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:00:06.0
totalMemory: 11.91GiB freeMemory: 11.75GiB
2018-12-29 19:29:11.896880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-29 19:29:12.960736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-29 19:29:12.960804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2018-12-29 19:29:12.960819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2018-12-29 19:29:12.961681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11366 MB memory) -> physical GPU (device:
0, name: TITAN Xp, pci bus id: 0000:00:06.0, compute capability: 6.1)
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv2d_1 (Conv2D)            (None, 16, 98, 98)        160
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 16, 96, 96)        2320
_________________________________________________________________
average_pooling2d_1 (Average (None, 16, 48, 48)        0
_________________________________________________________________
dropout_1 (Dropout)          (None, 16, 48, 48)        0
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 32, 46, 46)        4640
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 32, 44, 44)        9248
_________________________________________________________________
average_pooling2d_2 (Average (None, 32, 22, 22)        0
_________________________________________________________________
dropout_2 (Dropout)          (None, 32, 22, 22)        0
_________________________________________________________________
flatten_1 (Flatten)          (None, 15488)             0
_________________________________________________________________
dense_1 (Dense)              (None, 128)               1982592
_________________________________________________________________
dropout_3 (Dropout)          (None, 128)               0
_________________________________________________________________
dense_2 (Dense)              (None, 256)               33024
_________________________________________________________________
dropout_4 (Dropout)          (None, 256)               0
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 257
=================================================================
Total params: 2,032,241
Trainable params: 2,032,241
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
   4/8065 [..............................] - ETA: 1:52:06 - loss: 0.9940 - acc: 0.4531 - precision: 0.4947 - recall: 0.71882018-12-29 19:29:54.459471: E tensorflow/stream_executor/cuda/cuda_event.cc:48] E
rror polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2018-12-29 19:29:54.459645: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1
Aborted

带有Tensorflow和Keras的CUDA_ERROR_LAUNCH_FAILED [英] CUDA_ERROR_LAUNCH_FAILED with Tensorflow and Keras

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

带有Tensorflow和Keras的CUDA_ERROR_LAUNCH_FAILED [英] CUDA_ERROR_LAUNCH_FAILED with Tensorflow and Keras

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭