如何强制张量流使用所有可用的GPU? [英] How to force tensorflow to use all available GPUs?

查看:98
本文介绍了如何强制张量流使用所有可用的GPU?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个8 GPU集群,当我运行 Tensorflow代码(粘贴在下面),它仅使用单个GPU而不是全部8.我使用nvidia-smi确认了这一点.

I have an 8 GPU cluster and when I run a piece of Tensorflow code (pasted below), it only utilizes a single GPU instead of all 8. I confirmed this using nvidia-smi.

# Set some parameters
IMG_WIDTH = 256
IMG_HEIGHT = 256
IMG_CHANNELS = 3
TRAIN_IM = './train_im/'
TRAIN_MASK = './train_mask/'
TEST_PATH = './test/'

warnings.filterwarnings('ignore', category=UserWarning, module='skimage')
num_training = len(os.listdir(TRAIN_IM))
num_test = len(os.listdir(TEST_PATH))
# Get and resize train images
X_train = np.zeros((num_training, IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS), dtype=np.uint8)
Y_train = np.zeros((num_training, IMG_HEIGHT, IMG_WIDTH, 1), dtype=np.bool)
print('Getting and resizing train images and masks ... ')
sys.stdout.flush()

#load training images
for count, filename in tqdm(enumerate(os.listdir(TRAIN_IM)), total=num_training):
    img = imread(os.path.join(TRAIN_IM, filename))[:,:,:IMG_CHANNELS]
    img = resize(img, (IMG_HEIGHT, IMG_WIDTH), mode='constant', preserve_range=True)
    X_train[count] = img
    name, ext = os.path.splitext(filename)
    mask_name = name + '_mask' + ext
    mask = cv2.imread(os.path.join(TRAIN_MASK, mask_name))[:,:,:1]
    mask = resize(mask, (IMG_HEIGHT, IMG_WIDTH))
    Y_train[count] = mask

# Check if training data looks all right
ix = random.randint(0, num_training-1)
print(ix)
imshow(X_train[ix])
plt.show()
imshow(np.squeeze(Y_train[ix]))
plt.show()
# Define IoU metric
def mean_iou(y_true, y_pred):
    prec = []
    for t in np.arange(0.5, 1.0, 0.05):
        y_pred_ = tf.to_int32(y_pred > t)
        score, up_opt = tf.metrics.mean_iou(y_true, y_pred_, 2)
        K.get_session().run(tf.local_variables_initializer())
        with tf.control_dependencies([up_opt]):
            score = tf.identity(score)
        prec.append(score)
    return K.mean(K.stack(prec), axis=0)

# Build U-Net model
inputs = Input((IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
s = Lambda(lambda x: x / 255) (inputs)
width = 64
c1 = Conv2D(width, (3, 3), activation='relu', padding='same') (s)
c1 = Conv2D(width, (3, 3), activation='relu', padding='same') (c1)
p1 = MaxPooling2D((2, 2)) (c1)

c2 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (p1)
c2 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (c2)
p2 = MaxPooling2D((2, 2)) (c2)

c3 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (p2)
c3 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (c3)
p3 = MaxPooling2D((2, 2)) (c3)

c4 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (p3)
c4 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (c4)
p4 = MaxPooling2D(pool_size=(2, 2)) (c4)

c5 = Conv2D(width*16, (3, 3), activation='relu', padding='same') (p4)
c5 = Conv2D(width*16, (3, 3), activation='relu', padding='same') (c5)

u6 = Conv2DTranspose(width*8, (2, 2), strides=(2, 2), padding='same') (c5)
u6 = concatenate([u6, c4])
c6 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (u6)
c6 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (c6)

u7 = Conv2DTranspose(width*4, (2, 2), strides=(2, 2), padding='same') (c6)
u7 = concatenate([u7, c3])
c7 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (u7)
c7 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (c7)

u8 = Conv2DTranspose(width*2, (2, 2), strides=(2, 2), padding='same') (c7)
u8 = concatenate([u8, c2])
c8 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (u8)
c8 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (c8)

u9 = Conv2DTranspose(width, (2, 2), strides=(2, 2), padding='same') (c8)
u9 = concatenate([u9, c1], axis=3)
c9 = Conv2D(width, (3, 3), activation='relu', padding='same') (u9)
c9 = Conv2D(width, (3, 3), activation='relu', padding='same') (c9)

outputs = Conv2D(1, (1, 1), activation='sigmoid') (c9)

model = Model(inputs=[inputs], outputs=[outputs])

sgd = optimizers.SGD(lr=0.03, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd, loss='binary_crossentropy', metrics=[mean_iou])
model.summary()

# Fit model
earlystopper = EarlyStopping(patience=20, verbose=1)
checkpointer = ModelCheckpoint('nuclei_only.h5', verbose=1, save_best_only=True)
results = model.fit(X_train, Y_train, validation_split=0.05, batch_size = 32, verbose=1, epochs=100, 
                callbacks=[earlystopper, checkpointer])

我想使用mxnet或其他方法来运行所有可用GPU的代码.但是,我不确定如何执行此操作.所有资源仅显示如何在mnist数据集上执行此操作.我有自己的数据集,正在以不同的方式阅读.因此,不太确定如何修改代码.

I would like to use mxnet or some other method to run this code all available GPUs. However, I'm not sure how to do this. All the resources only show how to do this on mnist data set. I have my own data set that I am reading differently. Hence, not quite sure how to amend the code.

推荐答案

TL; DR :使用

TL;DR: Use tf.distribute.MirroredStrategy() as a scope, like

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    [...create model as you would otherwise...]

如果您未指定任何参数,则 tf.distribute.MirroredStrategy() 使用所有可用的GPU.您还可以根据需要指定要使用的内容,例如:mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"]).

If you do not specify any arguments, tf.distribute.MirroredStrategy() will use all available GPUs. You can also specify which ones to use if you want, like this: mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"]).

有关实施细节和其他策略,请参考此使用TensorFlow进行分布式培训指南.

Refer to this Distributed training with TensorFlow guide for implementation details and other strategies.

较早的答案(现已过时:已弃用,自2020年4月1日.): 使用Keras中的 multi_gpu_model() . ()


TS; WM :

Earlier answer (now outdated: deprecated, removed as of April 1, 2020.): Use multi_gpu_model() from Keras. ()


TS;WM:

TensorFlow 2.0现在具有 tf.distribute 模块,在多个设备上运行计算".它建立在分配策略"的概念上.您可以指定分发策略,然后将其用作范围. TensorFlow将基本透明地拆分输入,并行化计算并为您连接输出.反向传播也受此限制.由于所有处理现在都在幕后完成,因此您可能想熟悉可用的策略及其参数,因为它们可能会极大地影响您的培训速度.例如,您是否希望变量驻留在CPU上?然后使用 tf.distribute.experimental.CentralStorageStrategy() .有关更多信息,请参见使用TensorFlow进行分布式培训指南.

TensorFlow 2.0 now has the tf.distribute module, "a library for running a computation across multiple devices". It builds on the concept of "distribution strategies". You can specify the distribution strategy and then use it as a scope. TensorFlow will split the input, parallelize the calculations, and join the outputs for you basically transparently. Backpropagation is also subject to this. Since all processing is now done behind the scenes, you might want to familiarize yourself with the available strategies and their parameters as they might affect the speed of your training a lot. For example, do you want variables to reside on the CPU? Then use tf.distribute.experimental.CentralStorageStrategy(). Refer to the Distributed training with TensorFlow guide for more info.

较早的答案(现已过时,请留在此处以供参考):

Tensorflow指南:

如果系统中有多个GPU,则默认情况下将选择ID最低的GPU.

If you have more than one GPU in your system, the GPU with the lowest ID will be selected by default.

如果要使用多个GPU,不幸的是,您必须手动指定要在每个GPU上放置的张量

If you want to use multiple GPUs, unfortunately you have to manually specify what tensors to put on each GPU like

with tf.device('/device:GPU:2'):

使用多个GPU的Tensorflow指南中的更多信息.

关于如何在多个GPU上分布网络,主要有两种方法.

In terms of how to distribute your network over the multiple GPUs, there are two main ways of doing that.

  1. 您可以在GPU上逐层分布网络.这很容易实现,但不会带来很多性能优势,因为GPU会互相等待以完成操作.

  1. You distribute your network layer-wise over the GPUs. This is easier to implement but will not yield a lot of performance benefit because the GPUs will wait for each other to complete the operation.

您可以创建网络的单独副本,在每个GPU上称为塔".当您输入八重网络时,您将输入批次分成8部分,然后分发它们.让网络向前传播,然后对梯度求和,然后向后传播.这将导致几乎-线性加速与GPU的数量.但是,实现起来要困难得多,因为您还必须处理与批处理规范化相关的复杂性,因此建议您确保正确地对批处理进行随机化.这里有一个一个不错的教程.您还应该查看 Inception V3代码关于如何构造这样的东西的想法.尤其是_tower_loss()_average_gradients()train()的一部分以for i in range(FLAGS.num_gpus):开头.

You create separate copies of your network, called "towers" on each GPU. When you feed the octuple network, you break up you input batch into 8 parts, and distribute them. Let the network forward propagate, then sum the gradients, and do the backward propagation. This will result in an almost-linear speedup with the number of GPUs. It's much more difficult to implement, however, because you also have to deal with complexities related to batch normalization, and very advisable to make sure you randomize your batch properly. There is a nice tutorial here. You should also review the Inception V3 code referenced there for ideas how to structure such a thing. Especially _tower_loss(), _average_gradients() and the part of train() starting with for i in range(FLAGS.num_gpus):.

如果您想尝试Keras,它现在通过 multi_gpu_model()大大简化了多GPU培训. .它可以为您完成所有繁重的工作.

In case you want to give Keras a try, it now has simplified multi-gpu training significantly with multi_gpu_model(). It can do all the heavy lifting for you.

这篇关于如何强制张量流使用所有可用的GPU?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆