Keras +张量流+ P100:cudaErrorNotSupported = 71错误 [英] Keras + tensorflow + P100 : cudaErrorNotSupported = 71 error

查看:232
本文介绍了Keras +张量流+ P100:cudaErrorNotSupported = 71错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

很抱歉,如果已经在其他地方对此进行了报道,我一直在寻找它,但没有成功.

Apologies if this has been reported already at some other place, I have been looking for it quite some time, without success.

在使用P100 GPGPU使用keras + tensorflow运行简单的mnist示例(在github /fchollet/keras/blob/master/examples/mnist_cnn.py上提供)时,我们在keras/tensorflow/cuda的交集处遇到了一个问题:

While running the simple mnist example (available on github /fchollet/keras/blob/master/examples/mnist_cnn.py) with keras+tensorflow using a P100 GPGPU we encounter an issue at the intersection of keras/tensorflow/cuda:


Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla P100-PCIE-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.3285
pciBusID 0000:02:00.0
Total memory: 15.89GiB
Free memory: 15.51GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:02:00.0)
F tensorflow/core/common_runtime/gpu/gpu_device.cc:121] Check failed: err == cudaSuccess (71 vs. 0)
srun: error: nid02011: task 0: Aborted
srun: Terminating job step 1262138.0

我们正在使用keras 2.0.2,tensorflow 1.0.0. CUDA 8.0.53. 我们似乎在python2.7.12和python3.5.2(keras 1.2和2.0 ...)中都遇到了这个问题

We are using keras 2.0.2, tensorflow 1.0.0. cuda 8.0.53. We seem to be having this issue both in python2.7.12 and python3.5.2 (keras 1.2 and 2.0 ...)

裸张量流运行测试进展顺利,这使我们认为这确实是在keras/tensorflow/cuda的交集处.

Bare tensorflow runtest are going fine, which lead us to think that this is really at the intersection of keras/tensorflow/cuda.

同一测试可以在具有相同版本软件但使用TitanX GPGPU的各种机器上正常运行.

The same test runs fine on various machine with the same version of the software but with TitanX GPGPU.

似乎可以追溯到 CUDA错误类型


cudaErrorNotSupported = 71
This error indicates the attempted operation is not supported on the current system or device.

对于解决该问题的下一步工作我一无所知.对于此问题的任何反馈和指导,我将不胜感激.

I am clueless on where to look next to solve this issue. I would greatly appreciate any feedback and guidance on this matter.

推荐答案

问题的根本原因似乎是Tensorflow与CUDA MPS服务之间的不兼容(请参阅相关的Tensorflow跟踪器问题

The underlying source of the problem here appears to be an incompatibility between Tensorflow and the CUDA MPS service (see a related Tensorflow tracker issue here). It should only effect clusters and large systems which use the MPS service to improve the granularity of access to GPU devices.

这可能是 NVIDIA和 Tensorflow开发团队的一个错误.

This should probably be raised as a bug with both NVIDIA and the Tensorflow development team.

已编辑以添加来自Tensorflow Tracker问题的诊断:

看来,根本原因是Tensorflow中广泛使用了流回调,而在NVIDIA最近发布的Volta硬件之前,MPS尚不支持该流回调.显然,也可以从源代码中构建带有选项的Tensorflow,这将使其在早期的硬件上也能与MPS一起正常工作.有关更多详细信息,请参见链接的跟踪器讨论.

It appears the underlying reason is the extensive use of stream callbacks in Tensorflow, which MPS has not supported before the recent Volta hardware release from NVIDIA. Apparently it is also possible to build Tensorflow from source with options which will make it work correctly with MPS on earlier hardware as well. See the linked tracker discussion for more details.

[此答案是通过注释汇编而成的,并添加为社区Wiki条目,以使其从CUDA标签的未答复列表中消失]

[This answer was assembled from comments and added as a community wiki entry in order to get it off the unanswered list for the CUDA tag]

这篇关于Keras +张量流+ P100:cudaErrorNotSupported = 71错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆