重启AMI后NVidia驱动程序未在AWS上运行 [英] NVidia drivers not running on AWS after restarting the AMI

查看:143
本文介绍了重启AMI后NVidia驱动程序未在AWS上运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每个人,我都遇到以下问题:

everybody, I have the following problem:

我使用此 AMI 。我安装了一些工具,例如屏幕,割炬等。然后我成功地使用GPU进行了一些实验,并创建了实例的映像,以便可以终止它并稍后再次运行。

I started a P2 instance with this AMI. I installed some tools like screen, torch, etc. Then I successfully run some experiments using GPU and I created an image of the instance, so that I can terminate it and run it again later.

稍后,我从之前创建的AMI启动了一个新实例。一切看起来都很好-屏幕,割炬,我的实验都在系统上进行,但我无法运行与以前相同的实验:

Later I started a new instance from the AMI I created before. Everything looked fine - screen, torch, my experiments were present on the system, but I couldn't run the same experiments as before:


NVIDIA-SMI之所以失败,是因为它无法与NVIDIA
驱动程序进行通信。确保已安装最新的NVIDIA驱动程序并正在运行

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

对我来说,似乎已安装了驱动程序(因为所有其他工具都是以前安装的),但它们没有运行。这是正确的假设吗?如何启动它们?

To me it looks like the drivers might be installed (because all other tools are installed from before), but they are not running. Is it a correct assumption? How can I start them?

推荐答案

我们最近遇到了这个问题。在我们的案例中,似乎AWS实例上的默认内核已升级(从4.4.0-1049-aws升级到4.4.0-1061-aws),但是新内核未安装nvidia模块:

We had this problem recently. In our case, it seems that the default kernel on AWS instance was upgraded (from 4.4.0-1049-aws to 4.4.0-1061-aws), but the new kernel did not have nvidia modules installed:

ubuntu@ip-XXX-XXX-XXX-XXX:~$ ls -laR /lib/modules/4.4.0-1061-aws | grep -i nvidia
ubuntu@ip-XXX-XXX-XXX-XXX:~$ ls -laR /lib/modules/4.4.0-1049-aws | grep -i nvidia
-rw-r--r--  1 root root    87368 Jun 27 10:21 nvidia-drm.ko
-rw-r--r--  1 root root  1155304 Jun 27 10:21 nvidia-modeset.ko
-rw-r--r--  1 root root  1163016 Jun 27 10:21 nvidia-uvm.ko
-rw-r--r--  1 root root 18014088 Jun 27 10:21 nvidia.ko

检查您的内核版本(uname -a)以查看如果是这样的话。 GRUB配置允许引导旧的内核映像(1049),但默认情况下它正在加载新的内核映像(1061)。 / boot / grub / cfg的相关部分:

Check your kernel version (uname -a) to see if this is the case for you. GRUB configuration allowed booting an old kernel image (1049), but by default it was loading the new one (1061). The relevant portion of /boot/grub/cfg:

ubuntu@ip-XXX-XXX-XXX-XXX:~$ grep -i -e "ubuntu, with linux" /boot/grub/grub.cfg
    menuentry 'Ubuntu, with Linux 4.4.0-1061-aws' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.4.0-1061-aws-advanced-XXXX' {
    menuentry 'Ubuntu, with Linux 4.4.0-1061-aws (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.4.0-1061-aws-recovery-XXXX' {
    menuentry 'Ubuntu, with Linux 4.4.0-1049-aws' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.4.0-1049-aws-advanced-XXXX' {
    menuentry 'Ubuntu, with Linux 4.4.0-1049-aws (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.4.0-1049-aws-recovery-XXXX' {

您可以在下次重启时强制使用grub-reboot加载旧内核:

You can force that on the next reboot it loads the old kernel by using grub-reboot:

sudo /usr/sbin/grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux 4.4.0-1049-aws"
sudo reboot

这将使用具有nvidia模块的旧内核启动实例。

This will boot the instance with the old kernel, for which you have nvidia modules.

这篇关于重启AMI后NVidia驱动程序未在AWS上运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆