训练因 ResourceExausted 错误而中断 [英] Training broke with ResourceExausted error

查看:23
本文介绍了训练因 ResourceExausted 错误而中断的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 tensorflow 和机器学习的新手.最近我在做一个模型.我的模型如下,

I am new to tensorflow and Machine Learning. Recently I am working on a model. My model is like below,

  1. 字符级嵌入向量 -> 嵌入查找 -> LSTM1

  1. Character level Embedding Vector -> Embedding lookup -> LSTM1

词级嵌入向量->嵌入查找-> LSTM2

Word level Embedding Vector->Embedding lookup -> LSTM2

[LSTM1+LSTM2] -> 单层 MLP-> softmax 层

[LSTM1+LSTM2] -> single layer MLP-> softmax layer

[LSTM1+LSTM2] -> 单层 MLP-> WGAN 判别器

[LSTM1+LSTM2] -> Single layer MLP-> WGAN discriminator

he rnn 模型代码

Code of he rnn model

当我在研究这个模型时,我遇到了以下错误.我以为我的批次太大了.因此我试图将批量大小从 20 减少到 10,但它不起作用.

while I'm working on this model I got the following error. I thought My batch is too big. Thus I tried to reduce the batch size from 20 to 10 but it doesn't work.

ResourceExhaustedError(回溯见上文):分配时OOM张量形状[24760,100] [[节点:字符/bidirectional_rnn/bw/bw/while/bw/lstm_cell/split =拆分[T=DT_FLOAT, num_split=4,_device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients_2/Add_3/y,chars/bidirectional_rnn/bw/bw/while/bw/lstm_cell/BiasAdd)]] [[节点:双 lstm/bidirectional_rnn/bw/bw/stack/_167 =_Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0",send_device="/job:localhost/replica:0/task:0/device:GPU:0",send_device_incarnation=1,tensor_name="edge_636_bi-lstm/bidirectional_rnn/bw/bw/stack",张量类型=DT_INT32,_device="/job:localhost/replica:0/task:0/device:CPU:0"]]

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[24760,100] [[Node: chars/bidirectional_rnn/bw/bw/while/bw/lstm_cell/split = Split[T=DT_FLOAT, num_split=4, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients_2/Add_3/y, chars/bidirectional_rnn/bw/bw/while/bw/lstm_cell/BiasAdd)]] [[Node: bi-lstm/bidirectional_rnn/bw/bw/stack/_167 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_636_bi-lstm/bidirectional_rnn/bw/bw/stack", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

具有 shape[24760,100] 的张量意味着 2476000*32/8*1024*1024 = 9.44519043 MB 内存.我在 Titan X(11 GB) GPU 上运行代码.什么可能出错?为什么会出现这种类型的错误?

tensor with shape[24760,100] means 2476000*32/8*1024*1024 = 9.44519043 MB memory. I am running the code on a titan X(11 GB) GPU. What could go wrong? Why this type of error occurred?

* 额外信息 *:LSTM1 的大小为 100.对于双向 LSTM,它变为 200.LSTM2 的大小是 300.对于双向 LSTM,它变成 600.

* Extra info *: the size of the LSTM1 is 100. for bidirectional LSTM it becomes 200. The size of the LSTM2 is 300. For Bidirectional LSTM it becomes 600.

*注 *:错误发生在 32 epoch 之后.我的问题是为什么在 32 epoch 之后会出现错误.为什么不在最初的时代.

*Note *: The error occurred after 32 epoch. My question is why after 32 epoch there is an error. Why not at the initial epoch.

推荐答案

这些天我一直在调整以解决这个问题.

I have been tweaking a lot these days to solve this problem.

最后,我还没有解开问题中描述的内存大小之谜.我猜在计算梯度时,tensoflow 会积累很多用于计算梯度的额外内存.我需要检查 tensorflow 的来源,这在这个时候看起来很麻烦.您可以通过以下命令从终端检查您的模型使用了多少内存,

Finally, I haven't solved the mystery of the memory size described in the question. I guess while computing the gradient tensoflow accumulate a lot of additional memory for computing gradient. I need to check the source of the tensorflow which seems very cumbersome at this time. You can check how much memory your model is using from terminal by the following command,

nvidia-smi

从这个命令判断你可以猜出你可以使用多少额外的内存.

judging from this command you can guess how much additional memory you can use.

但解决这类问题的方法在于减少批量大小,

But the solution to these type of problem lies on reducing the batch size,

就我而言,将批次的大小减少到 3 个作品.这可能会有所不同模型到模型.

For my case reducing the size of the batch to 3 works. This may vary model to model.

但是,如果您使用的模型的嵌入矩阵要大得多,以至于您无法将它们加载到内存中,该怎么办?

But what if you are using a model where the embedding matrix is much bigger that you cannot load them into memory?

解决方案是编写一些痛苦的代码.

The solution is to write some painy code.

您必须查找嵌入矩阵,然后将嵌入加载到模型中.简而言之,对于每个批次,您必须将查找矩阵提供给模型(通过 sess.run() 中的 feed_dict 参数提供它们).

You have to lookup on the embedding matrix and then load the embedding to the model. In short, for each batch, you have to give the lookup matrixes to the model(feed them by the feed_dict argument in the sess.run()).

接下来你会面临一个新的问题,

Next you will face a new problem,

您不能以这种方式使嵌入trainable.解决方案是在 placeholder 中使用嵌入并将它们分配给 Variable(例如 A).每批训练后,学习算法更新变量A.然后通过 tensorflow 计算 A 向量的输出,并将它们分配给模型之外的嵌入矩阵.(我说过程很痛苦)

You cannot make the embeddings trainable in this way. The solution is to use the embedding in a placeholder and assign them to a Variable(say for example A). After each batch of training, the learning algorithm updates the variable A. Then compute the output of A vector by tensorflow and assign them to your embedding matrix which is outside of the model. (I said that the process is painy)

现在您的下一个问题应该是,如果您无法将嵌入查找提供给模型,因为它太大了怎么办.这是您无法避免的基本问题.这就是为什么 NVIDIA GTX 1080、1080ti 和 NVIDA TITAN Xp 的价格差异如此之大,而 NVIDIA 1080ti 和 1080 的运行频率更高.

Now your next question should be, what if you cannot feed the embedding lookup to the model because it's so big. This is a fundamental problem that you cannot avoid. That's why the NVIDIA GTX 1080, 1080ti and NVIDA TITAN Xp have so price difference though NVIDIA 1080ti and 1080 have the higher frequency to run an execution.

这篇关于训练因 ResourceExausted 错误而中断的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆