了解 ResourceExhaustedError: OOM 分配带形状的张量时 [英] Understanding the ResourceExhaustedError: OOM when allocating tensor with shape

查看:25
本文介绍了了解 ResourceExhaustedError: OOM 分配带形状的张量时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 tensorflow 实现跳过思想模型,当前版本位于

I'm trying to implement a skip thought model using tensorflow and a current version is placed here.

目前我使用我机器的一个 GPU(总共 2 个 GPU)并且 GPU 信息是

Currently I using one GPU of my machine (total 2 GPUs) and the GPU info is

2017-09-06 11:29:32.657299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.683
pciBusID 0000:02:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB

但是,当我尝试向模型提供数据时出现了 OOM.我尝试调试如下:

However, I got OOM when I'm trying to feed data to the model. I try to debug as follow:

我在运行 sess.run(tf.global_variables_initializer())

    logger.info('Total: {} params'.format(
        np.sum([
            np.prod(v.get_shape().as_list())
            for v in tf.trainable_variables()
        ])))

并得到 2017-09-06 11:29:51,333 INFO main main.py:127 - Total: 62968629 params,如果都使用 ,大约 240Mb>tf.float32.tf.global_variables 的输出是

and got 2017-09-06 11:29:51,333 INFO main main.py:127 - Total: 62968629 params, roughly about 240Mb if all using tf.float32. The output of tf.global_variables is

[<tf.Variable 'embedding/embedding_matrix:0' shape=(155229, 200) dtype=float32_ref>,
 <tf.Variable 'encoder/rnn/gru_cell/gates/kernel:0' shape=(400, 400) dtype=float32_ref>,
 <tf.Variable 'encoder/rnn/gru_cell/gates/bias:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'encoder/rnn/gru_cell/candidate/kernel:0' shape=(400, 200) dtype=float32_ref>,
 <tf.Variable 'encoder/rnn/gru_cell/candidate/bias:0' shape=(200,) dtype=float32_ref>,
 <tf.Variable 'decoder/weights:0' shape=(200, 155229) dtype=float32_ref>,
 <tf.Variable 'decoder/biases:0' shape=(155229,) dtype=float32_ref>,
 <tf.Variable 'decoder/previous_decoder/rnn/gru_cell/gates/kernel:0' shape=(400, 400) dtype=float32_ref>,
 <tf.Variable 'decoder/previous_decoder/rnn/gru_cell/gates/bias:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'decoder/previous_decoder/rnn/gru_cell/candidate/kernel:0' shape=(400, 200) dtype=float32_ref>,
 <tf.Variable 'decoder/previous_decoder/rnn/gru_cell/candidate/bias:0' shape=(200,) dtype=float32_ref>,
 <tf.Variable 'decoder/next_decoder/rnn/gru_cell/gates/kernel:0' shape=(400, 400) dtype=float32_ref>,
 <tf.Variable 'decoder/next_decoder/rnn/gru_cell/gates/bias:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'decoder/next_decoder/rnn/gru_cell/candidate/kernel:0' shape=(400, 200) dtype=float32_ref>,
 <tf.Variable 'decoder/next_decoder/rnn/gru_cell/candidate/bias:0' shape=(200,) dtype=float32_ref>,
 <tf.Variable 'global_step:0' shape=() dtype=int32_ref>]

在我的训练短语中,我有一个数据数组,其形状为 (164652, 3, 30),即 sample_size x 3 x time_step3 这里的意思是上一句,当前句,下一句.此训练数据的大小约为 57Mb,并存储在 loader 中.然后我用一个生成器函数来获取句子,看起来像

In my training phrase, I have a data array whose shape is (164652, 3, 30), namely sample_size x 3 x time_step, the 3 here means the previous sentence, current sentence and next sentence. The size of this training data is about 57Mb and is stored in a loader. Then I use write a generator function to get the sentences, looks like

def iter_batches(self, batch_size=128, time_major=True, shuffle=True):

    num_samples = len(self._sentences)
    if shuffle:
        samples = self._sentences[np.random.permutation(num_samples)]
    else:
        samples = self._sentences

    batch_start = 0
    while batch_start < num_samples:
        batch = samples[batch_start:batch_start + batch_size]

        lens = (batch != self._vocab[self._vocab.pad_token]).sum(axis=2)
        y, x, z = batch[:, 0, :], batch[:, 1, :], batch[:, 2, :]
        if time_major:
            yield (y.T, lens[:, 0]), (x.T, lens[:, 1]), (z.T, lens[:, 2])
        else:
            yield (y, lens[:, 0]), (x, lens[:, 1]), (z, lens[:, 2])
        batch_start += batch_size

训练循环看起来像

for epoch in num_epochs:
    batches = loader.iter_batches(batch_size=args.batch_size)
    try:
        (y, y_lens), (x, x_lens), (z, z_lens) =  next(batches)
        _, summaries, loss_val = sess.run(
        [train_op, train_summary_op, st.loss],
        feed_dict={
            st.inputs: x,
            st.sequence_length: x_lens,
            st.previous_targets: y,
            st.previous_target_lengths: y_lens,
            st.next_targets: z,
            st.next_target_lengths: z_lens
        })
    except StopIteraton:
        ...

然后我收到了一个 OOM.如果我注释掉整个 try 主体(不提供数据),脚本运行得很好.

Then I got a OOM. If I comment out the whole try body (no to feed data), the script run just fine.

我不知道为什么我会在这么小的数据规模上出现 OOM.使用 nvidia-smi 我总是得到

I have no idea why I got OOM in such a small data scale. Using nvidia-smi I always got

Wed Sep  6 12:03:37 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.59                 Driver Version: 384.59                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   44C    P2    60W / 275W |  10623MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
|  0%   43C    P2    62W / 275W |  10621MiB / 11171MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     32748    C   python3                                      10613MiB |
|    1     32748    C   python3                                      10611MiB |
+-----------------------------------------------------------------------------+

无法看到我的脚本的实际 GPU 使用情况,因为 tensorflow 总是在开始时窃取所有内存.这里的实际问题是我不知道如何调试.

I can't see the actual GPU usage of my script since tensorflow always steals all memory at the beginning. And the actual problem here is I don't know how to debug this.

我在 StackOverflow 上阅读了一些关于 OOM 的帖子.大多数情况发生在将大量测试集数据馈送到模型时,通过小批量馈送数据可以避免该问题.但我不明白为什么在我的 11Gb 1080Ti 中看到这么小的数据和参数组合很糟糕,因为它只是尝试分配一个矩阵大小的[3840 x 155229] 错误.(解码器的输出矩阵,3840 = 30(time_steps) x 128(batch_size)155229为vocab_size).

I've read some posts about OOM on StackOverflow. Most of them happened when feeding a large test set data to the model and feeding the data by small batches can avoid the problem. But I don't why see such a small data and param combination sucks in my 11Gb 1080Ti, since the error it just try to allocate a matrix sized [3840 x 155229]. (The output matrix of the decoder, 3840 = 30(time_steps) x 128(batch_size), 155229 is vocab_size).

2017-09-06 12:14:45.787566: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ********************************************************************************************xxxxxxxx
2017-09-06 12:14:45.787597: W tensorflow/core/framework/op_kernel.cc:1158] Resource exhausted: OOM when allocating tensor with shape[3840,155229]
2017-09-06 12:14:45.788735: W tensorflow/core/framework/op_kernel.cc:1158] Resource exhausted: OOM when allocating tensor with shape[3840,155229]
     [[Node: decoder/previous_decoder/Add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](decoder/previous_decoder/MatMul, decoder/biases/read)]]
2017-09-06 12:14:45.790453: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 2857 get requests, put_count=2078 evicted_count=1000 eviction_rate=0.481232 and unsatisfied allocation rate=0.657683
2017-09-06 12:14:45.790482: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1139, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1121, in _run_fn
    status, run_metadata)
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3840,155229]
     [[Node: decoder/previous_decoder/Add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](decoder/previous_decoder/MatMul, decoder/biases/read)]]
     [[Node: GradientDescent/update/_146 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2166_GradientDescent/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

During handling of the above exception, another exception occurred:

任何帮助将不胜感激.提前致谢.

Any help will be appreciated. Thanks in advance.

推荐答案

我们来一一道来:

关于 tensorflow 预先分配所有内存,您可以使用以下代码片段让 tensorflow 在需要时分配内存.这样您就可以了解事情的进展情况.

About tensorflow to allocate all memory in advance, you can use following code snippet to let tensorflow allocate memory whenever it is needed. So that you can understand how the things are going.

gpu_options = tf.GPUOptions(allow_growth=True)
session = tf.InteractiveSession(config=tf.ConfigProto(gpu_options=gpu_options))

如果您愿意,这与 tf.Session() 而不是 tf.InteractiveSession() 同样适用.

This works equally with tf.Session() instead of tf.InteractiveSession() if you prefer.

关于尺寸的第二件事,由于没有关于您的网络规模的信息,我们无法估计出了什么问题.但是,您也可以逐步调试所有网络.例如,创建一个只有一层的网络,获取其输出,一次创建会话和提要值,并可视化您消耗了多少内存.迭代此调试会话,直到您看到内存不足的点.

Second thing about the sizes, As there is no information about your network size, we cannot estimate what is going wrong. However, you can alternatively debug step by step all the network. For example, create a network only with one layer, get its output, create session and feed values once and visualize how much memory you consume. Iterate this debugging session until you see the point where you are going out of memory.

请注意,3840 x 155229 输出确实非常大.这意味着约 600M 神经元,每层仅约 2.22GB.如果你有任何类似大小的层,它们都会加起来非常快地填满你的 GPU 内存.

Please be aware that 3840 x 155229 output is really, REALLY a big output. It means ~600M neurons, and ~2.22GB per one layer only. If you have any similar size layers, all of them will add up to fill your GPU memory pretty fast.

此外,这仅适用于前向,如果您使用此层进行训练,则优化器添加的反向传播和层会将这个大小乘以 2.因此,对于训练,您仅为输出层消耗了约 5 GB.

Also, this is only for forward direction, if you are using this layer for training, the back propagation and layers added by optimizer will multiply this size by 2. So, for training you consume ~5 GB just for output layer.

我建议您修改您的网络并尝试减少批量大小/参数数量以使您的模型适合 GPU

I suggest you to revise your network and try to reduce batch size / parameter counts to fit your model to GPU

这篇关于了解 ResourceExhaustedError: OOM 分配带形状的张量时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆