具有多个 GPU 的 Tensorflow 多个会话 [英] Tensorflow multiple sessions with multiple GPUs

查看:25
本文介绍了具有多个 GPU 的 Tensorflow 多个会话的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有 2 个 GPU 的工作站,我正在尝试同时运行多个 tensorflow 作业,因此我可以一次训练多个模型,等等.

I have a workstation with 2 GPUs and I am trying to run multiple tensorflow jobs at the same time, so I can train more than one model at once, etc.

例如,我尝试通过在 script1.py 中使用的 python API 将会话分成不同的资源:

For example, I've tried to separate the sessions into different resources via the python API using in script1.py:

with tf.device("/gpu:0"):
    # do stuff

在 script2.py 中:

in script2.py:

with tf.device("/gpu:1"):
    # do stuff

在 script3.py 中

in script3.py

with tf.device("/cpu:0"):
    # do stuff

如果我单独运行每个脚本,我可以看到它正在使用指定的设备.(此外,这些模型非常适合单个 GPU,即使两者都可用,也不会使用另一个 GPU.)

If I run each script by itself I can see that it is using the specified device. (Also the models fit very well into a single GPU and doesn't use another one even if both are available.)

但是,如果一个脚本正在运行而我尝试运行另一个脚本,我总是会收到此错误:

However, if one script is running and I try to run another, I always get this error:

I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 8
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:909] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 0 with properties: 
name: GeForce GTX 980
major: 5 minor: 2 memoryClockRate (GHz) 1.2155
pciBusID 0000:01:00.0
Total memory: 4.00GiB
Free memory: 187.65MiB
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:909] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 1 with properties: 
name: GeForce GTX 980
major: 5 minor: 2 memoryClockRate (GHz) 1.2155
pciBusID 0000:04:00.0
Total memory: 4.00GiB
Free memory: 221.64MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:127] DMA: 0 1 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 0:   Y Y 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 1:   Y Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating    TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 980, pci bus id: 0000:04:00.0)
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Allocating 187.40MiB bytes.
E tensorflow/stream_executor/cuda/cuda_driver.cc:932] failed to allocate 187.40M (196505600 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
F tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Check failed: gpu_mem != nullptr  Could not allocate GPU device memory for device 0. Tried to allocate 187.40MiB
Aborted (core dumped)

似乎每个 tensorflow 进程在加载时都试图获取机器上的所有 GPU,即使并非所有设备都将用于运行模型.

It seems each tensorflow process is trying to grab all of the GPUs on the machine when it loads even if not all devices are going to be used to run the model.

我看到有一个选项可以限制每个进程使用的 GPU 数量

I see there is an option to limit the amount of GPU each process uses

tf.GPUOptions(per_process_gpu_memory_fraction=0.5)

...我还没有尝试过,但这似乎会让两个进程尝试共享每个 GPU 的 50%,而不是在单独的 GPU 上运行每个进程...

...I haven't tried it, but this seems like it would make two processes try to share 50% of each GPU instead of running each process on a separate GPU...

有谁知道如何将 tensorflow 配置为仅使用一个 GPU,而让另一个 GPU 可用于另一个 tensorflow 进程?

Does anyone know how to configure tensorflow to use only one GPU and leave the other available for another tensorflow process?

推荐答案

TensorFlow 将尝试使用(相同比例的内存)所有对其可见的 GPU 设备.如果您想在不同的 GPU 上运行不同的会话,您应该执行以下操作.

TensorFlow will attempt to use (an equal fraction of the memory of) all GPU devices that are visible to it. If you want to run different sessions on different GPUs, you should do the following.

  1. 在不同的 Python 进程中运行每个会话.
  2. 使用 CUDA_VISIBLE_DEVICES 环境变量的不同值启动每个进程.例如,如果您的脚本名为 my_script.py 并且您有 4 个 GPU,则可以运行以下命令:

  1. Run each session in a different Python process.
  2. Start each process with a different value for the CUDA_VISIBLE_DEVICES environment variable. For example, if your script is called my_script.py and you have 4 GPUs, you could run the following:

$ CUDA_VISIBLE_DEVICES=0 python my_script.py  # Uses GPU 0.
$ CUDA_VISIBLE_DEVICES=1 python my_script.py  # Uses GPU 1.
$ CUDA_VISIBLE_DEVICES=2,3 python my_script.py  # Uses GPUs 2 and 3.

请注意,TensorFlow 中的 GPU 设备仍将从零开始编号(即 "/gpu:0" 等),但它们将对应于您使用 设置为可见的设备CUDA_VISIBLE_DEVICES.

Note the GPU devices in TensorFlow will still be numbered from zero (i.e. "/gpu:0" etc.), but they will correspond to the devices that you have made visible with CUDA_VISIBLE_DEVICES.

这篇关于具有多个 GPU 的 Tensorflow 多个会话的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆