除非成功加载GPU,否则如何使TensorFlow不运行脚本? [英] How does one have TensorFlow not run the script unless the GPU was loaded successfully?

查看:202
本文介绍了除非成功加载GPU,否则如何使TensorFlow不运行脚本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试在装有GPU的计算机上运行TensorFlow培训,但是,每当尝试这样做时,我都会收到某种错误,似乎是由于某种原因而无法使用GPU(通常是内存)问题,或cuda问题或cudnn等).但是,由于TensorFlow会自动执行的操作是在无法使用GPU的情况下仅在CPU中运行,因此很难告诉我它是否确实能够利用GPU.因此,除非使用GPU,否则我想让脚本只是失败/停止.我该怎么办?

I have been trying to run some TensorFlow training on some machine with GPUs however, whenever I try to do so I get some type of error that seems to say it wasn't able to use the GPU for some reason (usually memory issue, or cuda issue or cudnn etc). However, since the thing TensorFlow does automatically is to just run in CPU if it can't use the GPU its been hard to tell for me if it was actually able to leverage the GPU or not. Thus, I wanted to have my script just fail/halt unless the GPU is being used. How do I do that?

为了举例,目前我收到消息:

For the sake of an example, currently I have the message:

I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla P100-SXM2-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.4805
pciBusID 0000:85:00.0
Total memory: 15.93GiB
Free memory: 15.63GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla P100-SXM2-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.4805
pciBusID 0000:85:00.0
Total memory: 15.93GiB
Free memory: 522.25MiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0)
E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:532] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

似乎可以很好地加载所有的cuda,但最后却抱怨.抱怨的行是:

where it seems to load all the cuda fine but then at the end complains. The complaining lines are:

E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:532] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

我们可以尝试调试这些特定的错误,但是就目前而言,它仍然可以进行培训,但是我不知道它是使用cpu还是gpu.如果出现任何奇怪的cuda/cudnn或任何gpu错误,我们可以不进行培训吗?

we could try to debug these specific bug but for the moment it proceeds to train however, I have no idea if its using cpu or gpu. Can we just have it not proceed training if any weird cuda/cudnn or whatever gpu bug comes up?

推荐答案

使用with tf.device('/gpu:0'):.如果/gpu:0不存在,这将杀死您的程序.

Use with tf.device('/gpu:0'):. This will kill your program if /gpu:0 doesnt exist.

例如,请参见 https://github.com/hughperkins/tensorflow-cl/blob/tensorflow-cl/tensorflow/stream_executor/cl/test/test_binary_ops.py#L52

with tf.Graph().as_default():
    with tf.Session(config=tf.ConfigProto(log_device_placement=False)) as sess:
        with tf.device('/gpu:0'):
            tf_a = tf.placeholder(tf_dtype, [None, None], 'a')
            tf_b = tf.placeholder(tf_dtype, [None, None], 'b')
            tf_c = tf.__dict__[tf_func](tf_a, tf_b, name="c")

这篇关于除非成功加载GPU,否则如何使TensorFlow不运行脚本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆