GPU 上的 tensorflow:没有已知设备,尽管 cuda 的 deviceQuery 返回“PASS"结果 [英] tensorflow on GPU: no known devices, despite cuda's deviceQuery returning a "PASS" result

查看:34
本文介绍了GPU 上的 tensorflow:没有已知设备,尽管 cuda 的 deviceQuery 返回“PASS"结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

<块引用>

注意:这个问题最初是在github上提出的,但它被要求来这里

我在 gpu 上运行 tensorflow 时遇到问题,这似乎不是通常的 cuda 配置问题,因为一切似乎都表明 cuda 已正确设置.

主要症状:运行tensorflow时,检测不到我的gpu(正在运行的代码,以及其输出).

与常见问题不同的是,cuda 似乎已正确安装并且从 cuda 示例中运行 ./deviceQuery 是成功的(输出).

我有两张显卡:

  • 用于我的显示器的旧 GTX 650(我不想将它与 tensorflow 一起使用)
  • 我想专用于 tensorflow 的 GTX 1060

我使用:

我试过了:

  • /usr/local/cuda/bin/ 添加到 $PATH
  • 使用 with tf.device('/gpu:1'):(和 with tf.device('/gpu:0'): 强制将 GPU 放置在 tensorflow 脚本中:代码>当它失败时,为了很好的衡量)
  • 将我想与 CUDA_VISIBLE_DEVICES 一起使用的 gpu 列入白名单,以防我的旧卡不支持导致问题
  • 使用 sudo 运行脚本(因为为什么不)

以下是 nvidia-sminvidia-debugdump -l,以防万一.

在这一点上,我觉得我已经遵循了所有的面包屑,不知道我还能尝试什么.我什至不确定我是在考虑错误还是配置问题.任何有关如何调试的建议将不胜感激.谢谢!

更新:在 github 上 Yaroslav 的帮助下,我通过提高日志级别收集了更多调试信息,但似乎并没有说明设备选择:https://gist.github.com/oelmekki/760a37ca50bf58d4f03f46d104b>798d

更新 2:使用 theano 正确检测 gpu,但有趣的是它抱怨 cuDNN 太新,然后回退到 cpu (代码运行noreferrer">noreferrer>.也许这也可能是 tensorflow 的问题?

从日志输出来看,您运行的是 CPU 版本的 TensorFlow (PyPI: tensorflow),而不是 GPU 版本(PyPI:tensorflow-gpu).运行 GPU 版本将记录有关 CUDA 库的信息,或者如果无法加载它们或打开驱动程序则出错.

如果您运行以下命令,您应该能够在后续运行中使用 GPU:

$ pip 卸载tensorflow$ pip 安装 tensorflow-gpu

Note : this question was initially asked on github, but it was asked to be here instead

I'm having trouble running tensorflow on gpu, and it does not seems to be the usual cuda's configuration problem, because everything seems to indicate cuda is properly setup.

The main symptom: when running tensorflow, my gpu is not detected (the code being run, and its output).

What differs from usual issues is that cuda seems properly installed and running ./deviceQuery from cuda samples is successful (output).

I have two graphical cards:

  • an old GTX 650 used for my monitors (I don't want to use that one with tensorflow)
  • a GTX 1060 that I want to dedicate to tensorflow

I use:

I've tried:

  • adding /usr/local/cuda/bin/ to $PATH
  • forcing gpu placement in tensorflow script using with tf.device('/gpu:1'): (and with tf.device('/gpu:0'): when it failed, for good measure)
  • whitelisting the gpu I wanted to use with CUDA_VISIBLE_DEVICES, in case the presence of my old unsupported card did cause problems
  • running the script with sudo (because why not)

Here are the outputs of nvidia-smi and nvidia-debugdump -l, in case it's useful.

At this point, I feel like I have followed all the breadcrumbs and have no idea what I could try else. I'm not even sure if I'm contemplating a bug or a configuration problem. Any advice about how to debug this would be greatly appreciated. Thanks!

Update: with the help of Yaroslav on github, I gathered more debugging info by raising log level, but it doesn't seem to say much about the device selection : https://gist.github.com/oelmekki/760a37ca50bf58d4f03f46d104b798bb

Update 2: Using theano detects gpu correctly, but interestingly it complains about cuDNN being too recent, then fallback to cpu (code ran, output). Maybe that could be the problem with tensorflow as well?

解决方案

From the log output, it looks like you are running the CPU version of TensorFlow (PyPI: tensorflow), and not the GPU version (PyPI: tensorflow-gpu). Running the GPU version would either log information about the CUDA libraries, or an error if it failed to load them or open the driver.

If you run the following commands, you should be able to use the GPU in subsequent runs:

$ pip uninstall tensorflow
$ pip install tensorflow-gpu

这篇关于GPU 上的 tensorflow:没有已知设备,尽管 cuda 的 deviceQuery 返回“PASS"结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆