停止和启动深度学习谷歌云 VM 实例导致 tensorflow 停止识别 GPU [英] Stopping and starting a deep learning google cloud VM instance causes tensorflow to stop recognizing GPU

查看:47
本文介绍了停止和启动深度学习谷歌云 VM 实例导致 tensorflow 停止识别 GPU的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用谷歌云提供的预构建深度学习 VM 实例,并连接了 Nvidia tesla K80 GPU.我选择自动安装 Tensorflow 2.5 和 CUDA 11.0.当我启动实例时,一切正常 - 我可以运行:

I am using the pre-built deep learning VM instances offered by google cloud, with an Nvidia tesla K80 GPU attached. I choose to have Tensorflow 2.5 and CUDA 11.0 automatically installed. When I start the instance, everything works great - I can run:

Import tensorflow as tf
tf.config.list_physical_devices()

我的函数返回 CPU、加速 CPU 和 GPU.同样,如果我运行 tf.test.is_gpu_available(),该函数返回 True.

And my function returns the CPU, accelerated CPU, and the GPU. Similarly, if I run tf.test.is_gpu_available(), the function returns True.

但是,如果我注销,停止实例,然后重新启动实例,运行相同的代码只会看到 CPU 和 tf.test.is_gpu_available() 结果为 False.我收到一个错误,提示驱动程序初始化失败:

However, if I log out, stop the instance, and then restart the instance, running the same exact code only sees the CPU and tf.test.is_gpu_available() results in False. I get an error that looks like the driver initialization is failing:

 E tensorflow/stream_executor/cuda/cuda_driver.cc:355] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error

运行 nvidia-smi 显示电脑还是能看到 GPU,但是我的 tensorflow 看不到.

Running nvidia-smi shows that the computer still sees the GPU, but my tensorflow can’t see it.

有谁知道是什么原因造成的?我不想在重新启动实例时重新安装所有内容.

Does anyone know what could be causing this? I don’t want to have to reinstall everything when I’m restarting the instance.

推荐答案

有些人(可惜不是我)可以通过在他们的脚本/主程序的开头设置以下内容来解决这个问题:

Some people (sadly not me) are able to resolve this by setting the following at the beginning of their script/main:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

我不得不重新安装 CUDA 驱动程序,从那时起,即使在重新启动实例后它也能工作.您可以在 NVIDIA 网站,它将为您提供安装 cuda 所需遵循的命令.它还询问您是否要卸载以前的 cuda 版本(是的!).幸运的是,这也非常快.

I had to reinstall CUDA drivers and from then on it worked even after restarting the instance. You can configure your system settings on NVIDIAs website and it will provide you the commands you need to follow to install cuda. It also asks you if you want to uninstall the previous cuda version (yes!).This is luckily also very fast.

这篇关于停止和启动深度学习谷歌云 VM 实例导致 tensorflow 停止识别 GPU的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆