为什么TensorFlow始终使用GPU 0? [英] Why does TensorFlow always use GPU 0?

查看:100
本文介绍了为什么TensorFlow始终使用GPU 0?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在多GPU设置上运行TensorFlow推理时遇到问题.

环境:Python 3.6.4; TensorFlow 1.8.0; Centos 7.3; 2英伟达Tesla P4

这是系统免费时的nvidia-smi输出:

Tue Aug 28 10:47:42 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:0C.0 Off |                    0 |
| N/A   38C    P0    22W /  75W |      0MiB /  7606MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P4            Off  | 00000000:00:0D.0 Off |                    0 |
| N/A   39C    P0    23W /  75W |      0MiB /  7606MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

与我的问题有关的主要声明:

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

def get_sess_and_tensor(ckpt_path):
    assert os.path.exists(ckpt_path), "file: {} not exist.".format(ckpt_path)
    graph = tf.Graph()
    with graph.as_default():
        od_graph_def = tf.GraphDef()
        with tf.gfile.GFile(ckpt_path, "rb") as fid1:
            od_graph_def.ParseFromString(fid1.read())
            tf.import_graph_def(od_graph_def, name="")
        sess = tf.Session(graph=graph)
    with tf.device('/gpu:1'):
        tensor = graph.get_tensor_by_name("image_tensor:0")
        boxes = graph.get_tensor_by_name("detection_boxes:0")
        scores = graph.get_tensor_by_name("detection_scores:0")
        classes = graph.get_tensor_by_name('detection_classes:0')

    return sess, tensor, boxes, scores, classes

因此,问题是,即使将tf.device设置为GPU 1,我的可见设备设置为'0,1',运行推理时,我还是从nvidia-smi中看到仅使用了GPU 0(GPU 0的GPU-Util很高-几乎是100%-而GPU 1的GPU-Util是0).为什么不使用GPU 1?

我想并行使用两个GPU,但是即使使用以下代码,它仍仅使用GPU 0:

with tf.device('/gpu:0'):
    tensor = graph.get_tensor_by_name("image_tensor:0")
    boxes = graph.get_tensor_by_name("detection_boxes:0")
with tf.device('/gpu:1'):
    scores = graph.get_tensor_by_name("detection_scores:0")
    classes = graph.get_tensor_by_name('detection_classes:0')

任何建议都将不胜感激.

谢谢.

卫斯理

解决方案

您可以使用 GPUtil 包选择未使用的GPU,并过滤CUDA_VISIBLE_DEVICES环境变量.

这将允许您在所有GPU上运行并行实验.

# Import os to set the environment variable CUDA_VISIBLE_DEVICES
import os
import tensorflow as tf
import GPUtil

# Set CUDA_DEVICE_ORDER so the IDs assigned by CUDA match those from nvidia-smi
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"

# Get the first available GPU
DEVICE_ID_LIST = GPUtil.getFirstAvailable()
DEVICE_ID = DEVICE_ID_LIST[0] # grab first element from list

# Set CUDA_VISIBLE_DEVICES to mask out all other GPUs than the first available device id
os.environ["CUDA_VISIBLE_DEVICES"] = str(DEVICE_ID)

# Since all other GPUs are masked out, the first available GPU will now be identified as GPU:0
device = '/gpu:0'
print('Device ID (unmasked): ' + str(DEVICE_ID))
print('Device ID (masked): ' + str(0))

# Run a minimum working example on the selected GPU
# Start a session
with tf.Session() as sess:
    # Select the device
    with tf.device(device):
        # Declare two numbers and add them together in TensorFlow
        a = tf.constant(12)
        b = tf.constant(30)
        result = sess.run(a+b)
        print('a+b=' + str(result))

参考: https://github.com/anderskm/gputil

I hit a problem when running TensorFlow inference on multiple-GPU setups.

Environment: Python 3.6.4; TensorFlow 1.8.0; Centos 7.3; 2 Nvidia Tesla P4

Here is the nvidia-smi output when the system is free:

Tue Aug 28 10:47:42 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:0C.0 Off |                    0 |
| N/A   38C    P0    22W /  75W |      0MiB /  7606MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P4            Off  | 00000000:00:0D.0 Off |                    0 |
| N/A   39C    P0    23W /  75W |      0MiB /  7606MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The key statements related to my issue:

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

def get_sess_and_tensor(ckpt_path):
    assert os.path.exists(ckpt_path), "file: {} not exist.".format(ckpt_path)
    graph = tf.Graph()
    with graph.as_default():
        od_graph_def = tf.GraphDef()
        with tf.gfile.GFile(ckpt_path, "rb") as fid1:
            od_graph_def.ParseFromString(fid1.read())
            tf.import_graph_def(od_graph_def, name="")
        sess = tf.Session(graph=graph)
    with tf.device('/gpu:1'):
        tensor = graph.get_tensor_by_name("image_tensor:0")
        boxes = graph.get_tensor_by_name("detection_boxes:0")
        scores = graph.get_tensor_by_name("detection_scores:0")
        classes = graph.get_tensor_by_name('detection_classes:0')

    return sess, tensor, boxes, scores, classes

So, the problem is, when set I visible devices to '0,1', even if I set tf.device to GPU 1, when running inference, I see from nvidia-smi that only GPU 0 is used (GPU 0's GPU-Util is high – almost 100% – whereas GPU 1's is 0). Why doesn't it use GPU 1?

I want to use the two GPUs in parallel, but even with the following code, it still uses only GPU 0:

with tf.device('/gpu:0'):
    tensor = graph.get_tensor_by_name("image_tensor:0")
    boxes = graph.get_tensor_by_name("detection_boxes:0")
with tf.device('/gpu:1'):
    scores = graph.get_tensor_by_name("detection_scores:0")
    classes = graph.get_tensor_by_name('detection_classes:0')

Any suggestions are greatly appreciated.

Thanks.

Wesley

解决方案

You can use the GPUtil package to select unused gpus and filter the CUDA_VISIBLE_DEVICES environnement variable.

This will allow you to run parallel experiments on all your gpus.

# Import os to set the environment variable CUDA_VISIBLE_DEVICES
import os
import tensorflow as tf
import GPUtil

# Set CUDA_DEVICE_ORDER so the IDs assigned by CUDA match those from nvidia-smi
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"

# Get the first available GPU
DEVICE_ID_LIST = GPUtil.getFirstAvailable()
DEVICE_ID = DEVICE_ID_LIST[0] # grab first element from list

# Set CUDA_VISIBLE_DEVICES to mask out all other GPUs than the first available device id
os.environ["CUDA_VISIBLE_DEVICES"] = str(DEVICE_ID)

# Since all other GPUs are masked out, the first available GPU will now be identified as GPU:0
device = '/gpu:0'
print('Device ID (unmasked): ' + str(DEVICE_ID))
print('Device ID (masked): ' + str(0))

# Run a minimum working example on the selected GPU
# Start a session
with tf.Session() as sess:
    # Select the device
    with tf.device(device):
        # Declare two numbers and add them together in TensorFlow
        a = tf.constant(12)
        b = tf.constant(30)
        result = sess.run(a+b)
        print('a+b=' + str(result))

Reference: https://github.com/anderskm/gputil

这篇关于为什么TensorFlow始终使用GPU 0?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆