培训因ResourceExausted错误而中断 [英] Training broke with ResourceExausted error

查看:102
本文介绍了培训因ResourceExausted错误而中断的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Tensorflow和机器学习的新手.最近,我正在制作模型.我的模型如下所示,

I am new to tensorflow and Machine Learning. Recently I am working on a model. My model is like below,

  1. 字符级嵌入向量->嵌入查找-> LSTM1

  1. Character level Embedding Vector -> Embedding lookup -> LSTM1

词级嵌入矢量->嵌入查找-> LSTM2

Word level Embedding Vector->Embedding lookup -> LSTM2

[LSTM1 + LSTM2]->单层MLP-> softmax层

[LSTM1+LSTM2] -> single layer MLP-> softmax layer

[LSTM1 + LSTM2]->单层MLP-> WGAN鉴别器

[LSTM1+LSTM2] -> Single layer MLP-> WGAN discriminator

模型的代码

在使用此模型时,出现以下错误.我以为我的批次太大了.因此,我试图将批量大小从20减少到10,但这是行不通的.

while I'm working on this model I got the following error. I thought My batch is too big. Thus I tried to reduce the batch size from 20 to 10 but it doesn't work.

ResourceExhaustedError(请参阅上面的回溯):分配时为OOM 具有形状的张量[24760,100] [[节点: chars/bidirectional_rnn/bw/bw/while/bw/lstm_cell/split = 分割[T = DT_FLOAT,num_split = 4, _device ="/job:localhost/replica:0/task:0/device:GPU:0"](gradients_2/Add_3/y, chars/bidirectional_rnn/bw/bw/while/bw/lstm_cell/BiasAdd)]] [[节点: bi-lstm/bidirectional_rnn/bw/bw/stack/_167 = _Recvclient_terminated = false,recv_device ="/job:localhost/副本:0/task:0/device:CPU:0", send_device ="/job:localhost/副本:0/task:0/device:GPU:0", send_device_incarnation = 1, tensor_name ="edge_636_bi-lstm/bidirectional_rnn/bw/bw/stack", tensor_type = DT_INT32, _device ="/job:localhost/replica:0/task:0/device:CPU:0"]]

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[24760,100] [[Node: chars/bidirectional_rnn/bw/bw/while/bw/lstm_cell/split = Split[T=DT_FLOAT, num_split=4, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients_2/Add_3/y, chars/bidirectional_rnn/bw/bw/while/bw/lstm_cell/BiasAdd)]] [[Node: bi-lstm/bidirectional_rnn/bw/bw/stack/_167 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_636_bi-lstm/bidirectional_rnn/bw/bw/stack", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

具有 shape [24760,100]

张量表示2476000 * 32/8 * 1024 * 1024 = 9.44519043 MB内存.我在titan X(11 GB)GPU上运行代码.可能出什么问题了?为什么会发生这种类型的错误?

tensor with shape[24760,100] means 2476000*32/8*1024*1024 = 9.44519043 MB memory. I am running the code on a titan X(11 GB) GPU. What could go wrong? Why this type of error occurred?

*额外信息* :LSTM1的大小为100.对于双向LSTM,它的大小为200. LSTM2的大小为300.对于双向LSTM,它的大小为600.

* Extra info *: the size of the LSTM1 is 100. for bidirectional LSTM it becomes 200. The size of the LSTM2 is 300. For Bidirectional LSTM it becomes 600.

*注意* :此错误在32个纪元后发生.我的问题是为什么在32个时代之后出现错误.为什么不在最初的时代.

*Note *: The error occurred after 32 epoch. My question is why after 32 epoch there is an error. Why not at the initial epoch.

推荐答案

这几天我一直在做很多调整以解决此问题.

I have been tweaking a lot these days to solve this problem.

最后,我还没有解决问题中描述的内存大小之谜.我想在计算梯度时,tensoflow会积累大量额外的内存来计算梯度.我需要检查这个时候看起来很麻烦的张量流的来源.您可以通过以下命令从终端检查模型正在使用多少内存,

Finally, I haven't solved the mystery of the memory size described in the question. I guess while computing the gradient tensoflow accumulate a lot of additional memory for computing gradient. I need to check the source of the tensorflow which seems very cumbersome at this time. You can check how much memory your model is using from terminal by the following command,

nvidia-smi

从此命令判断,您可以猜测可以使用多少额外的内存.

judging from this command you can guess how much additional memory you can use.

但是解决这类问题的方法在于减小批量大小,

But the solution to these type of problem lies on reducing the batch size,

就我而言,将批处理的大小减少到3个作品.这可能会有所不同 按模型建模.

For my case reducing the size of the batch to 3 works. This may vary model to model.

但是,如果您使用的模型中的嵌入矩阵大得多而无法将其加载到内存中,那么该怎么办?

But what if you are using a model where the embedding matrix is much bigger that you cannot load them into memory?

解决方案是编写一些痛苦的代码.

The solution is to write some painy code.

您必须查找嵌入矩阵,然后将嵌入加载到模型.简而言之,对于每一批,您都必须将查找矩阵提供给模型(通过sess.run()中的feed_dict参数提供它们).

You have to lookup on the embedding matrix and then load the embedding to the model. In short, for each batch, you have to give the lookup matrixes to the model(feed them by the feed_dict argument in the sess.run()).

接下来,您将面临一个新问题,

Next you will face a new problem,

您不能以这种方式制作嵌入trainable.解决方案是使用placeholder中的嵌入并将它们分配给Variable(例如A).在每批训练之后,学习算法都会更新变量A.然后通过张量流计算A向量的输出,并将它们分配给模型外部的嵌入矩阵. (我说这个过程很痛苦)

You cannot make the embeddings trainable in this way. The solution is to use the embedding in a placeholder and assign them to a Variable(say for example A). After each batch of training, the learning algorithm updates the variable A. Then compute the output of A vector by tensorflow and assign them to your embedding matrix which is outside of the model. (I said that the process is painy)

现在您的下一个问题应该是,如果因为嵌入的查找太大而无法将嵌入的查找提供给模型,该怎么办.这是您无法避免的基本问题.这就是为什么NVIDIA GTX 1080、1080ti和NVIDA TITAN Xp具有如此高的价格差异的原因,尽管NVIDIA 1080ti和1080具有更高的执行频率.

Now your next question should be, what if you cannot feed the embedding lookup to the model because it's so big. This is a fundamental problem that you cannot avoid. That's why the NVIDIA GTX 1080, 1080ti and NVIDA TITAN Xp have so price difference though NVIDIA 1080ti and 1080 have the higher frequency to run an execution.

这篇关于培训因ResourceExausted错误而中断的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆