如何解决“分配带有形状[XXX]的张量时的OOM"?在tensorflow中(训练GCN时) [英] How to solve "OOM when allocating tensor with shape[XXX]" in tensorflow (when training a GCN)

查看：98 发布时间：2021/4/24 19:50:05 tensorflow keras graph neural-network conv-neural-network

本文介绍了如何解决“分配带有形状[XXX]的张量时的OOM"?在tensorflow中(训练GCN时)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

所以...我已经检查了一些有关此问题的帖子(应该检查的帖子很多，但我认为现在就某个问题寻求帮助是合理的)，但是我没有找到任何解决方案可能适合我的情况.

So... I have checked a few posts on this issue (there should be many that I haven't checked but I think it's reasonable to seek help with a question now), but I haven't found any solution that might suit my situation.

此OOM错误消息总是出现在任意折叠训练循环的第二轮中(无任何例外)，并且在第一次运行后再次重新运行训练代码时.因此，这可能是与此帖子相关的问题:有关OOM的以前的stackoverflow问题已链接使用tf.nn.embedding_lookup()，但是我不确定我的问题位于哪个函数中.

This OOM error message always emerge (with no single exception) in the second round of a whatever-fold training loop, and when re-running the training code again after a first run. So this might be an issue related to this post: A previous stackoverflow question for OOM linked with tf.nn.embedding_lookup(), but I am not sure which function my issue lies in.

我的NN是具有两个图形卷积层的GCN，并且我正在具有几个10 GB Nvidia P102-100 GPU的服务器上运行代码.将batch_size设置为1，但没有任何改变.也正在使用Jupyter Notebook而不是通过命令运行python脚本，因为在命令行中我什至不能运行一轮...顺便说一句，有人知道为什么在命令行中弹出OOM时为什么某些代码可以在Jupyter上运行而没有问题吗?对我来说似乎有点奇怪.

My NN is a GCN with two graph convolutional layers, and I am running the code on a server with several 10 GB Nvidia P102-100 GPUs. Have set batch_size to 1 but nothing has changed. Also am using Jupyter Notebook rather than running python scripts with command because in command line I cannot even run one round... Btw does anyone know why some code can run without problem on Jupyter while popping OOM in command line? It seems a bit strange to me.

更新:用GlobalMaxPool()替换Flatten()后，错误消失了，我可以顺利运行代码了.但是，如果我进一步添加一个GC层，则该错误将在第一轮出现.因此，我想核心问题仍然存在...

UPDATE: After replacing Flatten() with GlobalMaxPool(), the error disappeared and I can run the code smoothly. However, if I further add one GC layer, the error would come in the first round. Thus, I guess the core issue is still there...

UPDATE2:试图将 tf.Tensor 替换为 tf.SparseTensor .成功但没有用.还尝试按照ML_Engine的答案中所述设置镜像策略，但看起来其中一个GPU占用最大，而OOM仍然出现了.也许这是一种数据并行"处理.并且由于将 batch_size 设置为1而无法解决我的问题?

UPDATE2: Tried to replace tf.Tensor with tf.SparseTensor. Successful but of no use. Also tried to set up the mirrored strategy as mentioned in ML_Engine's answer, but it looks like one of the GPU is occupied most highly and OOM still came out. Perhaps it's kind of "data parallel" and cannot solve my problem since I have set batch_size to 1?

代码(改编自 GCNG ):

from keras import Input, Model
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers import Dense, Flatten
from keras.optimizers import Adam
from keras.regularizers import l2
import tensorflow as tf
#from spektral.datasets import mnist
from spektral.layers import GraphConv
from spektral.layers.ops import sp_matrix_to_sp_tensor
from spektral.utils import normalized_laplacian
from keras.utils import plot_model
from sklearn import metrics
import numpy as np
import gc

l2_reg = 5e-7  # Regularization rate for l2
learning_rate = 1*1e-6  # Learning rate for SGD
batch_size = 1  # Batch size
epochs = 1 # Number of training epochs
es_patience = 50  # Patience fot early stopping

# DATA IMPORTING & PREPROCESSING OMITTED

# this part of adjacency matrix calculation is not important...
fltr = self_connection_normalized_adjacency(adj)
test = fltr.toarray()
t = tf.convert_to_tensor(test)
A_in = Input(tensor=t)
del fltr, test, t
gc.collect()


# Here comes the issue.

for test_indel in range(1,11):

    # SEVERAL LINES OMITTED (get X_train, y_train, X_val, y_val, X_test, y_test)
    
    # Build model
    N = X_train.shape[-2]  # Number of nodes in the graphs
    F = X_train.shape[-1]  # Node features dimensionality
    n_out = y_train.shape[-1]  # Dimension of the target
    X_in = Input(shape=(N, F))
    graph_conv = GraphConv(32,activation='elu',kernel_regularizer=l2(l2_reg),use_bias=True)([X_in, A_in])
    graph_conv = GraphConv(32,activation='elu',kernel_regularizer=l2(l2_reg),use_bias=True)([graph_conv, A_in])
    flatten = Flatten()(graph_conv)
    fc = Dense(512, activation='relu')(flatten)
    output = Dense(n_out, activation='sigmoid')(fc)
    model = Model(inputs=[X_in, A_in], outputs=output)
    optimizer = Adam(lr=learning_rate)
    model.compile(optimizer=optimizer,loss='binary_crossentropy',metrics=['acc'])
    model.summary()

    save_dir = current_path+'/'+str(test_indel)+'_self_connection_Ycv_LR_as_nega_rg_5-7_lr_1-6_e'+str(epochs)
    if not os.path.isdir(save_dir):
        os.makedirs(save_dir)
    early_stopping = EarlyStopping(monitor='val_acc', patience=es_patience, verbose=0, mode='auto')
    checkpoint1 = ModelCheckpoint(filepath=save_dir + '/weights.{epoch:02d}-{val_loss:.2f}.hdf5', monitor='val_loss',verbose=1, save_best_only=False, save_weights_only=False, mode='auto', period=1)
    checkpoint2 = ModelCheckpoint(filepath=save_dir + '/weights.hdf5', monitor='val_acc', verbose=1,save_best_only=True, mode='auto', period=1)
    callbacks = [checkpoint2, early_stopping]

    # Train model
    validation_data = (X_val, y_val)
    print('batch size = '+str(batch_size))
    history = model.fit(X_train,y_train,batch_size=batch_size,validation_data=validation_data,epochs=epochs,callbacks=callbacks)

    # Prediction and write-file code omitted
    del X_in, X_data_train,Y_data_train,gene_pair_index_train,count_setx_train,X_data_test, Y_data_test,gene_pair_index_test,trainX_index,validation_index,train_index, X_train, y_train, X_val, y_val, X_test, y_test, validation_data, graph_conv, flatten, fc, output, model, optimizer, history 
    gc.collect()

模型摘要:

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_2 (InputLayer)            (None, 13129, 2)     0                                            
__________________________________________________________________________________________________
input_1 (InputLayer)            (13129, 13129)       0                                            
__________________________________________________________________________________________________
graph_conv_1 (GraphConv)        (None, 13129, 32)    96          input_2[0][0]                    
                                                                 input_1[0][0]                    
__________________________________________________________________________________________________
graph_conv_2 (GraphConv)        (None, 13129, 32)    1056        graph_conv_1[0][0]               
                                                                 input_1[0][0]                    
__________________________________________________________________________________________________
flatten_1 (Flatten)             (None, 420128)       0           graph_conv_2[0][0]               
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 512)          215106048   flatten_1[0][0]                  
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 1)            513         dense_1[0][0]                    
==================================================================================================
Total params: 215,107,713
Trainable params: 215,107,713
Non-trainable params: 0
__________________________________________________________________________________________________
batch size = 1

错误消息(请注意，在重新启动并清除输出之后的第一轮中，该消息永远不会出现):

Error message (Please note that this message never comes during the first round after a Restart-and-Clear-Output):

Train on 2953 samples, validate on 739 samples
Epoch 1/1
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-5-943385df49dc> in <module>()
     62     mem = psutil.virtual_memory()
     63     print("current mem " + str(round(mem.percent))+'%')
---> 64     history = model.fit(X_train,y_train,batch_size=batch_size,validation_data=validation_data,epochs=epochs,callbacks=callbacks)
     65     mem = psutil.virtual_memory()
     66     print("current mem " + str(round(mem.percent))+'%')

/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
   1237                                         steps_per_epoch=steps_per_epoch,
   1238                                         validation_steps=validation_steps,
-> 1239                                         validation_freq=validation_freq)
   1240 
   1241     def evaluate(self,

/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/keras/engine/training_arrays.py in fit_loop(model, fit_function, fit_inputs, out_labels, batch_size, epochs, verbose, callbacks, val_function, val_inputs, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq)
    194                     ins_batch[i] = ins_batch[i].toarray()
    195 
--> 196                 outs = fit_function(ins_batch)
    197                 outs = to_list(outs)
    198                 for l, o in zip(out_labels, outs):

/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/tensorflow/python/keras/backend.py in __call__(self, inputs)
   3290 
   3291     fetched = self._callable_fn(*array_vals,
-> 3292                                 run_metadata=self.run_metadata)
   3293     self._call_fetch_callbacks(fetched[-len(self._fetches):])
   3294     output_structure = nest.pack_sequence_as(

/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
   1456         ret = tf_session.TF_SessionRunCallable(self._session._session,
   1457                                                self._handle, args,
-> 1458                                                run_metadata_ptr)
   1459         if run_metadata:
   1460           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[420128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node training_1/Adam/mul_23}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[metrics_1/acc/Identity/_323]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[420128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node training_1/Adam/mul_23}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

如何解决“分配带有形状[XXX]的张量时的OOM"?在tensorflow中(训练GCN时) [英] How to solve "OOM when allocating tensor with shape[XXX]" in tensorflow (when training a GCN)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何解决“分配带有形状[XXX]的张量时的OOM"?在tensorflow中(训练GCN时) [英] How to solve &quot;OOM when allocating tensor with shape[XXX]&quot; in tensorflow (when training a GCN)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

如何解决“分配带有形状[XXX]的张量时的OOM"?在tensorflow中(训练GCN时) [英] How to solve "OOM when allocating tensor with shape[XXX]" in tensorflow (when training a GCN)

登录关闭