共享GPU上的Tensorflow:如何自动选择未使用的GPU [英] Tensorflow on shared GPUs: how to automatically select the one that is unused

查看:97
本文介绍了共享GPU上的Tensorflow:如何自动选择未使用的GPU的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以通过ssh访问n个GPU的集群。 Tensorflow会自动为其命名gpu:0,...,gpu:(n-1)。

I have access through ssh to a cluster of n GPUs. Tensorflow automatically gave them names gpu:0,...,gpu:(n-1).

其他人也有访问权限,有时他们会随机使用gpu。
我没有明确地放置任何 tf.device(),因为这很麻烦,即使我选择了gpu编号j并且已经有人在gpu编号j上,

Others have access too and sometimes they take random gpus. I did not place any tf.device() explicitely because that is cumbersome and even if I selected gpu number j and that someone is already on gpu number j that would be problematic.

我想遍历gpus的用法,找到第一个未使用的gpus,并仅使用它。
我猜有人可以用bash解析 nvidia-smi 的输出并获得变量i并将该变量i作为gpu的数量输入到tensorflow脚本中

I would like to go throuh the gpus usage and find the first that is unused and use only this one. I guess someone could parse the output of nvidia-smi with bash and get a variable i and feed that variable i to the tensorflow script as the number of the gpu to use.

我从来没有见过这样的例子。我认为这是一个非常普遍的问题。最简单的方法是什么?纯张量流可用吗?

I have never seen any example of this. I imagine it is a pretty common problem. What would be the simplest way to do that ? Is a pure tensorflow one available ?

推荐答案

我不知道纯TensorFlow解决方案。问题在于TensorFlow配置的现有位置是会话配置。但是,对于GPU内存,一个进程中的所有TensorFlow会话都共享一个GPU内存池,因此Session config将是错误的添加位置,并且没有用于进程全局配置的机制(但应该存在能够配置进程全局Eigen线程池)。因此,您需要使用 CUDA_VISIBLE_DEVICES 环境变量在流程级别上进行操作。

I'm not aware of pure-TensorFlow solution. The problem is that existing place for TensorFlow configurations is a Session config. However, for GPU memory, a GPU memory pool is shared for all TensorFlow sessions within a process, so Session config would be the wrong place to add it, and there's no mechanism for process-global config (but there should be, to also be able to configure process-global Eigen threadpool). So you need to do on on a process level by using CUDA_VISIBLE_DEVICES environment variable.

类似这样的事情:

import subprocess, re

# Nvidia-smi GPU memory parsing.
# Tested on nvidia-smi 370.23

def run_command(cmd):
    """Run command, return output as string."""
    output = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True).communicate()[0]
    return output.decode("ascii")

def list_available_gpus():
    """Returns list of available GPU ids."""
    output = run_command("nvidia-smi -L")
    # lines of the form GPU 0: TITAN X
    gpu_regex = re.compile(r"GPU (?P<gpu_id>\d+):")
    result = []
    for line in output.strip().split("\n"):
        m = gpu_regex.match(line)
        assert m, "Couldnt parse "+line
        result.append(int(m.group("gpu_id")))
    return result

def gpu_memory_map():
    """Returns map of GPU id to memory allocated on that GPU."""

    output = run_command("nvidia-smi")
    gpu_output = output[output.find("GPU Memory"):]
    # lines of the form
    # |    0      8734    C   python                                       11705MiB |
    memory_regex = re.compile(r"[|]\s+?(?P<gpu_id>\d+)\D+?(?P<pid>\d+).+[ ](?P<gpu_memory>\d+)MiB")
    rows = gpu_output.split("\n")
    result = {gpu_id: 0 for gpu_id in list_available_gpus()}
    for row in gpu_output.split("\n"):
        m = memory_regex.search(row)
        if not m:
            continue
        gpu_id = int(m.group("gpu_id"))
        gpu_memory = int(m.group("gpu_memory"))
        result[gpu_id] += gpu_memory
    return result

def pick_gpu_lowest_memory():
    """Returns GPU with the least allocated memory"""

    memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
    best_memory, best_gpu = sorted(memory_gpu_map)[0]
    return best_gpu

然后将其放入 utils.py 并在首次导入 tensorflow 之前在TensorFlow脚本中设置GPU。 IE

You can then put it in utils.py and set GPU in your TensorFlow script before first tensorflow import. IE

import utils
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(utils.pick_gpu_lowest_memory())
import tensorflow

这篇关于共享GPU上的Tensorflow:如何自动选择未使用的GPU的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆