无法使用自定义容器在Google AI平台上加载动态库libcuda.so.1错误 [英] Could not load dynamic library libcuda.so.1 error on Google AI Platform with custom container

查看:60
本文介绍了无法使用自定义容器在Google AI平台上加载动态库libcuda.so.1错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用自定义容器在Google AI平台上开展培训工作.当我想使用GPU进行训练时,我用于容器的基本图像是:

  FROM NVIDIA/CUDA:11.1.1-cudnn8-runtime-ubuntu18.04 

有了这个图像(在其顶部安装了tensorflow 2.4.1),我认为我可以在AI平台上使用GPU,但事实并非如此.训练开始时,日志显示如下:

  W tensorflow/stream_executor/platform/default/dso_loader.cc:60]无法加载动态库'libcuda.so.1';dlerror:libcuda.so.1:无法打开共享对象文件:没有这样的文件或目录;LD_LIBRARY_PATH:/usr/local/nvidia/lib:/usr/local/nvidia/lib64我tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156]内核驱动程序似乎未在此主机上运行(gke-cml-0309-144111--n1-highmem-8-43e-0b9fbbdc-gnq6):/proc/驱动程序/nvidia/版本不存在我tensorflow/compiler/jit/xla_gpu_device.cc:99]未创建XLA设备,未设置tf_xla_enable_xla_devices警告:tensorflow:tf.distribute.Strategy中包含非GPU设备,未使用nccl allreduce. 

这是构建图像以在Google AI平台上使用GPU的好方法吗?还是应该改为依靠张量流映像并手动安装所有需要的驱动程序来利用GPU?

我在这里阅读(

解决方案

构建最可靠的容器的建议方法是使用官方维护的深度学习容器".我建议拉'gcr.io/deeplearning-platform-release/tf2-gpu.2-4'.这应该已经安装了CUDA,CUDNN,GPU驱动程序和TF 2.4.经过测试.您只需要在其中添加代码即可.

I'm trying to launch a training job on Google AI Platform with a custom container. As I want to use GPUs for the training, the base image I've used for my container is:

FROM nvidia/cuda:11.1.1-cudnn8-runtime-ubuntu18.04

With this image (and tensorflow 2.4.1 installed on top of that) I thought I can use the GPUs on AI Platform but it does not seem to be the case. When training starts, the logs are showing following:

W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (gke-cml-0309-144111--n1-highmem-8-43e-0b9fbbdc-gnq6): /proc/driver/nvidia/version does not exist
I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
WARNING:tensorflow:There are non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.

Is this a good way to build an image to use GPUs on Google AI Platform? Or should I try instead to rely on a tensorflow image and install manually all the needed drivers to exploit GPUs?

EDIT: I read here (https://cloud.google.com/ai-platform/training/docs/containers-overview) the following:

For training with GPUs, your custom container needs to meet a few
special requirements. You must build a different Docker image than     
what you'd use for training with CPUs.

Pre-install the CUDA toolkit and cuDNN in your Docker image. Using the 
nvidia/cuda image as your base image is the recommended way to handle 
this. It has the matching versions of CUDA toolkit and cuDNN pre-
installed, and it helps you set up the related environment variables 
correctly.

Install your training application, along with your required ML     
framework and other dependencies in your Docker image.

They also give a Dockerfile example here for training with GPUs. So what I did seems ok. Unfortunately I still have these errors mentioned above that could explain (or not) why I cannot use GPUs on Google AI Platform.

EDIT2: As read here (https://www.tensorflow.org/install/gpu) my Dockerfile is now:

FROM tensorflow/tensorflow:2.4.1-gpu
RUN apt-get update && apt-get install -y \
 lsb-release \
 vim \
 curl \
 git \
 libgl1-mesa-dev \
 software-properties-common \
 wget && \
 rm -rf /var/lib/apt/lists/*

# Add NVIDIA package repositories
RUN wget -nv https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
RUN mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
RUN add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
RUN apt-get update

RUN wget -nv http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

RUN apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
RUN apt-get update

# Install NVIDIA driver
RUN apt-get install -y --no-install-recommends nvidia-driver-450
# Reboot. Check that GPUs are visible using the command: nvidia-smi

RUN wget -nv https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
RUN apt install ./libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
RUN apt-get update

# Install development and runtime libraries (~4GB)
RUN apt-get install --no-install-recommends \
    cuda-11-0 \
    libcudnn8=8.0.4.30-1+cuda11.0  \
    libcudnn8-dev=8.0.4.30-1+cuda11.0


# other stuff

Problem is that build freezes at the stage of what seems to be a keyboard-configuration. System asks to select a country and when I enter the number, nothing happens

解决方案

The suggested way to build the most reliable container is to use the officially maintained 'Deep Learning Containers'. I would suggest pulling 'gcr.io/deeplearning-platform-release/tf2-gpu.2-4'. This should already have CUDA, CUDNN, GPU Drivers, and TF 2.4 installed & tested. You'll just need to add your code into it.

这篇关于无法使用自定义容器在Google AI平台上加载动态库libcuda.so.1错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆