无法使用自定义容器在Google AI平台上加载动态库libcuda.so.1错误 [英] Could not load dynamic library libcuda.so.1 error on Google AI Platform with custom container
问题描述
我正在尝试使用自定义容器在Google AI平台上开展培训工作.当我想使用GPU进行训练时,我用于容器的基本图像是:
FROM NVIDIA/CUDA:11.1.1-cudnn8-runtime-ubuntu18.04
有了这个图像(在其顶部安装了tensorflow 2.4.1),我认为我可以在AI平台上使用GPU,但事实并非如此.训练开始时,日志显示如下:
W tensorflow/stream_executor/platform/default/dso_loader.cc:60]无法加载动态库'libcuda.so.1';dlerror:libcuda.so.1:无法打开共享对象文件:没有这样的文件或目录;LD_LIBRARY_PATH:/usr/local/nvidia/lib:/usr/local/nvidia/lib64我tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156]内核驱动程序似乎未在此主机上运行(gke-cml-0309-144111--n1-highmem-8-43e-0b9fbbdc-gnq6):/proc/驱动程序/nvidia/版本不存在我tensorflow/compiler/jit/xla_gpu_device.cc:99]未创建XLA设备,未设置tf_xla_enable_xla_devices警告:tensorflow:tf.distribute.Strategy中包含非GPU设备,未使用nccl allreduce.
这是构建图像以在Google AI平台上使用GPU的好方法吗?还是应该改为依靠张量流映像并手动安装所有需要的驱动程序来利用GPU?
我在这里阅读(
构建最可靠的容器的建议方法是使用官方维护的深度学习容器".我建议拉'gcr.io/deeplearning-platform-release/tf2-gpu.2-4'.这应该已经安装了CUDA,CUDNN,GPU驱动程序和TF 2.4.经过测试.您只需要在其中添加代码即可.
- https://cloud.google.com/ai-platform/deep-learning-containers/docs/choosing-container
- https://console.cloud.google.com/gcr/images/deeplearning-platform-release?project = deeplearning-platform-release
- https://cloud.google.com/ai-platform/deep-learning-containers/docs/getting-started-local#create_your_container
I'm trying to launch a training job on Google AI Platform with a custom container. As I want to use GPUs for the training, the base image I've used for my container is:
FROM nvidia/cuda:11.1.1-cudnn8-runtime-ubuntu18.04
With this image (and tensorflow 2.4.1 installed on top of that) I thought I can use the GPUs on AI Platform but it does not seem to be the case. When training starts, the logs are showing following:
W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (gke-cml-0309-144111--n1-highmem-8-43e-0b9fbbdc-gnq6): /proc/driver/nvidia/version does not exist
I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
WARNING:tensorflow:There are non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.
Is this a good way to build an image to use GPUs on Google AI Platform? Or should I try instead to rely on a tensorflow image and install manually all the needed drivers to exploit GPUs?
EDIT: I read here (https://cloud.google.com/ai-platform/training/docs/containers-overview) the following:
For training with GPUs, your custom container needs to meet a few
special requirements. You must build a different Docker image than
what you'd use for training with CPUs.
Pre-install the CUDA toolkit and cuDNN in your Docker image. Using the
nvidia/cuda image as your base image is the recommended way to handle
this. It has the matching versions of CUDA toolkit and cuDNN pre-
installed, and it helps you set up the related environment variables
correctly.
Install your training application, along with your required ML
framework and other dependencies in your Docker image.
They also give a Dockerfile example here for training with GPUs. So what I did seems ok. Unfortunately I still have these errors mentioned above that could explain (or not) why I cannot use GPUs on Google AI Platform.
EDIT2: As read here (https://www.tensorflow.org/install/gpu) my Dockerfile is now:
FROM tensorflow/tensorflow:2.4.1-gpu
RUN apt-get update && apt-get install -y \
lsb-release \
vim \
curl \
git \
libgl1-mesa-dev \
software-properties-common \
wget && \
rm -rf /var/lib/apt/lists/*
# Add NVIDIA package repositories
RUN wget -nv https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
RUN mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
RUN add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
RUN apt-get update
RUN wget -nv http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
RUN apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
RUN apt-get update
# Install NVIDIA driver
RUN apt-get install -y --no-install-recommends nvidia-driver-450
# Reboot. Check that GPUs are visible using the command: nvidia-smi
RUN wget -nv https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
RUN apt install ./libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
RUN apt-get update
# Install development and runtime libraries (~4GB)
RUN apt-get install --no-install-recommends \
cuda-11-0 \
libcudnn8=8.0.4.30-1+cuda11.0 \
libcudnn8-dev=8.0.4.30-1+cuda11.0
# other stuff
Problem is that build freezes at the stage of what seems to be a keyboard-configuration. System asks to select a country and when I enter the number, nothing happens
The suggested way to build the most reliable container is to use the officially maintained 'Deep Learning Containers'. I would suggest pulling 'gcr.io/deeplearning-platform-release/tf2-gpu.2-4'. This should already have CUDA, CUDNN, GPU Drivers, and TF 2.4 installed & tested. You'll just need to add your code into it.
- https://cloud.google.com/ai-platform/deep-learning-containers/docs/choosing-container
- https://console.cloud.google.com/gcr/images/deeplearning-platform-release?project=deeplearning-platform-release
- https://cloud.google.com/ai-platform/deep-learning-containers/docs/getting-started-local#create_your_container
这篇关于无法使用自定义容器在Google AI平台上加载动态库libcuda.so.1错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!