如何解决“分配带有形状[XXX]的张量时的OOM"?在tensorflow中(训练GCN时) [英] How to solve "OOM when allocating tensor with shape[XXX]" in tensorflow (when training a GCN)

查看:98
本文介绍了如何解决“分配带有形状[XXX]的张量时的OOM"?在tensorflow中(训练GCN时)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以...我已经检查了一些有关此问题的帖子(应该检查的帖子很多,但我认为现在就某个问题寻求帮助是合理的),但是我没有找到任何解决方案可能适合我的情况.

So... I have checked a few posts on this issue (there should be many that I haven't checked but I think it's reasonable to seek help with a question now), but I haven't found any solution that might suit my situation.

此OOM错误消息总是出现在任意折叠训练循环的第二轮中(无任何例外),并且在第一次运行后再次重新运行训练代码时.因此,这可能是与此帖子相关的问题:有关OOM的以前的stackoverflow问题已链接使用tf.nn.embedding_lookup(),但是我不确定我的问题位于哪个函数中.

This OOM error message always emerge (with no single exception) in the second round of a whatever-fold training loop, and when re-running the training code again after a first run. So this might be an issue related to this post: A previous stackoverflow question for OOM linked with tf.nn.embedding_lookup(), but I am not sure which function my issue lies in.

我的NN是具有两个图形卷积层的GCN,并且我正在具有几个10 GB Nvidia P102-100 GPU的服务器上运行代码.将batch_size设置为1,但没有任何改变.也正在使用Jupyter Notebook而不是通过命令运行python脚本,因为在命令行中我什至不能运行一轮...顺便说一句,有人知道为什么在命令行中弹出OOM时为什么某些代码可以在Jupyter上运行而没有问题吗?对我来说似乎有点奇怪.

My NN is a GCN with two graph convolutional layers, and I am running the code on a server with several 10 GB Nvidia P102-100 GPUs. Have set batch_size to 1 but nothing has changed. Also am using Jupyter Notebook rather than running python scripts with command because in command line I cannot even run one round... Btw does anyone know why some code can run without problem on Jupyter while popping OOM in command line? It seems a bit strange to me.

更新:用GlobalMaxPool()替换Flatten()后,错误消失了,我可以顺利运行代码了.但是,如果我进一步添加一个GC层,则该错误将在第一轮出现.因此,我想核心问题仍然存在...

UPDATE: After replacing Flatten() with GlobalMaxPool(), the error disappeared and I can run the code smoothly. However, if I further add one GC layer, the error would come in the first round. Thus, I guess the core issue is still there...

UPDATE2:试图将 tf.Tensor 替换为 tf.SparseTensor .成功但没有用.还尝试按照ML_Engine的答案中所述设置镜像策略,但看起来其中一个GPU占用最大,而OOM仍然出现了.也许这是一种数据并行"处理.并且由于将 batch_size 设置为1而无法解决我的问题?

UPDATE2: Tried to replace tf.Tensor with tf.SparseTensor. Successful but of no use. Also tried to set up the mirrored strategy as mentioned in ML_Engine's answer, but it looks like one of the GPU is occupied most highly and OOM still came out. Perhaps it's kind of "data parallel" and cannot solve my problem since I have set batch_size to 1?

代码(改编自 GCNG ):

from keras import Input, Model
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers import Dense, Flatten
from keras.optimizers import Adam
from keras.regularizers import l2
import tensorflow as tf
#from spektral.datasets import mnist
from spektral.layers import GraphConv
from spektral.layers.ops import sp_matrix_to_sp_tensor
from spektral.utils import normalized_laplacian
from keras.utils import plot_model
from sklearn import metrics
import numpy as np
import gc

l2_reg = 5e-7  # Regularization rate for l2
learning_rate = 1*1e-6  # Learning rate for SGD
batch_size = 1  # Batch size
epochs = 1 # Number of training epochs
es_patience = 50  # Patience fot early stopping

# DATA IMPORTING & PREPROCESSING OMITTED

# this part of adjacency matrix calculation is not important...
fltr = self_connection_normalized_adjacency(adj)
test = fltr.toarray()
t = tf.convert_to_tensor(test)
A_in = Input(tensor=t)
del fltr, test, t
gc.collect()


# Here comes the issue.

for test_indel in range(1,11):

    # SEVERAL LINES OMITTED (get X_train, y_train, X_val, y_val, X_test, y_test)
    
    # Build model
    N = X_train.shape[-2]  # Number of nodes in the graphs
    F = X_train.shape[-1]  # Node features dimensionality
    n_out = y_train.shape[-1]  # Dimension of the target
    X_in = Input(shape=(N, F))
    graph_conv = GraphConv(32,activation='elu',kernel_regularizer=l2(l2_reg),use_bias=True)([X_in, A_in])
    graph_conv = GraphConv(32,activation='elu',kernel_regularizer=l2(l2_reg),use_bias=True)([graph_conv, A_in])
    flatten = Flatten()(graph_conv)
    fc = Dense(512, activation='relu')(flatten)
    output = Dense(n_out, activation='sigmoid')(fc)
    model = Model(inputs=[X_in, A_in], outputs=output)
    optimizer = Adam(lr=learning_rate)
    model.compile(optimizer=optimizer,loss='binary_crossentropy',metrics=['acc'])
    model.summary()

    save_dir = current_path+'/'+str(test_indel)+'_self_connection_Ycv_LR_as_nega_rg_5-7_lr_1-6_e'+str(epochs)
    if not os.path.isdir(save_dir):
        os.makedirs(save_dir)
    early_stopping = EarlyStopping(monitor='val_acc', patience=es_patience, verbose=0, mode='auto')
    checkpoint1 = ModelCheckpoint(filepath=save_dir + '/weights.{epoch:02d}-{val_loss:.2f}.hdf5', monitor='val_loss',verbose=1, save_best_only=False, save_weights_only=False, mode='auto', period=1)
    checkpoint2 = ModelCheckpoint(filepath=save_dir + '/weights.hdf5', monitor='val_acc', verbose=1,save_best_only=True, mode='auto', period=1)
    callbacks = [checkpoint2, early_stopping]

    # Train model
    validation_data = (X_val, y_val)
    print('batch size = '+str(batch_size))
    history = model.fit(X_train,y_train,batch_size=batch_size,validation_data=validation_data,epochs=epochs,callbacks=callbacks)

    # Prediction and write-file code omitted
    del X_in, X_data_train,Y_data_train,gene_pair_index_train,count_setx_train,X_data_test, Y_data_test,gene_pair_index_test,trainX_index,validation_index,train_index, X_train, y_train, X_val, y_val, X_test, y_test, validation_data, graph_conv, flatten, fc, output, model, optimizer, history 
    gc.collect()

模型摘要:

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_2 (InputLayer)            (None, 13129, 2)     0                                            
__________________________________________________________________________________________________
input_1 (InputLayer)            (13129, 13129)       0                                            
__________________________________________________________________________________________________
graph_conv_1 (GraphConv)        (None, 13129, 32)    96          input_2[0][0]                    
                                                                 input_1[0][0]                    
__________________________________________________________________________________________________
graph_conv_2 (GraphConv)        (None, 13129, 32)    1056        graph_conv_1[0][0]               
                                                                 input_1[0][0]                    
__________________________________________________________________________________________________
flatten_1 (Flatten)             (None, 420128)       0           graph_conv_2[0][0]               
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 512)          215106048   flatten_1[0][0]                  
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 1)            513         dense_1[0][0]                    
==================================================================================================
Total params: 215,107,713
Trainable params: 215,107,713
Non-trainable params: 0
__________________________________________________________________________________________________
batch size = 1

错误消息(请注意,在重新启动并清除输出之后的第一轮中,该消息永远不会出现):

Error message (Please note that this message never comes during the first round after a Restart-and-Clear-Output):

Train on 2953 samples, validate on 739 samples
Epoch 1/1
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-5-943385df49dc> in <module>()
     62     mem = psutil.virtual_memory()
     63     print("current mem " + str(round(mem.percent))+'%')
---> 64     history = model.fit(X_train,y_train,batch_size=batch_size,validation_data=validation_data,epochs=epochs,callbacks=callbacks)
     65     mem = psutil.virtual_memory()
     66     print("current mem " + str(round(mem.percent))+'%')

/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
   1237                                         steps_per_epoch=steps_per_epoch,
   1238                                         validation_steps=validation_steps,
-> 1239                                         validation_freq=validation_freq)
   1240 
   1241     def evaluate(self,

/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/keras/engine/training_arrays.py in fit_loop(model, fit_function, fit_inputs, out_labels, batch_size, epochs, verbose, callbacks, val_function, val_inputs, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq)
    194                     ins_batch[i] = ins_batch[i].toarray()
    195 
--> 196                 outs = fit_function(ins_batch)
    197                 outs = to_list(outs)
    198                 for l, o in zip(out_labels, outs):

/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/tensorflow/python/keras/backend.py in __call__(self, inputs)
   3290 
   3291     fetched = self._callable_fn(*array_vals,
-> 3292                                 run_metadata=self.run_metadata)
   3293     self._call_fetch_callbacks(fetched[-len(self._fetches):])
   3294     output_structure = nest.pack_sequence_as(

/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
   1456         ret = tf_session.TF_SessionRunCallable(self._session._session,
   1457                                                self._handle, args,
-> 1458                                                run_metadata_ptr)
   1459         if run_metadata:
   1460           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[420128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node training_1/Adam/mul_23}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[metrics_1/acc/Identity/_323]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[420128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node training_1/Adam/mul_23}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

推荐答案

您可以在tensorflow中使用分布式策略,以确保正确使用了多GPU设置:

You can make use of distributed strategies in tensorflow to make sure that your multi-GPU set up is being used appropriately:

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
    for test_indel in range(1,11):
         <etc>

请参见此处

镜像策略用于在单个服务器上的多个GPU之间进行同步分布式训练,这听起来像您正在使用的设置.在此博客中,还有>/a>.

Mirrored strategy is used for synchronous distributed training across multiple GPUs on a single server, which sounds like the setup you're using. There's also a more intuitive explanation in this blog.

此外,您可以尝试使用混合精度,通过更改模型中参数的float类型.

Also, you could try making use of mixed precision which should free up memory significantly by altering the float type of the parameters in the model.

这篇关于如何解决“分配带有形状[XXX]的张量时的OOM"?在tensorflow中(训练GCN时)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆