Google Colab Pro 在分配大内存时崩溃 [英] Google Colab Pro crashed while allocating large memory

查看:672
本文介绍了Google Colab Pro 在分配大内存时崩溃的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Colab pro GPU(最大 25Gb 内存)来训练顺序模型.根据找到的说明 此处,我将内存限制设置为 22Gb.下面是我的代码和日志.

I'm trying to use Colab pro GPU (max 25Gb memory) for training a sequential model. Based on the instructions found here, I'm setting the memory limit to 22Gb. Below is my code and logs.

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
mem_limit=22000

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=mem_limit)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

根据这个日志,它似乎正在设置上限

Dec 22, 2020, 7:57:15 PM    WARNING 2020-12-23 01:57:15.673093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22000 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0)

Dec 22, 2020, 7:57:15 PM    WARNING 2020-12-23 01:57:15.673030: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

但是,在执行语句时,它总是试图分配 37Gb 内存并且运行时崩溃.这是日志

However, when executing a statement, invariably it's attempting to allocate 37Gb memory and the runtime crashes. Here is the log

Dec 22, 2020, 8:01:01 PM    INFO    KernelRestarter: restarting kernel (1/5), keep random ports

Dec 22, 2020, 8:00:47 PM    WARNING tcmalloc: large alloc 37200994304 bytes == 0x7f48b828a000 @ 0x7f5249f5a001 0x7f52414564ff 0x7f52414a6ab8 0x7f52414aabb7 0x7f5241549003 0x50a4a5 0x50cc96 0x507be4 0x509900 0x50a2fd 0x50cc96 0x507be4 0x5161c5 0x50a12f 0x50beb4 0x507be4 0x509900 0x50a2fd 0x50beb4 0x507be4 0x509900 0x50a2fd 0x50cc96 0x507be4 0x508ec2 0x594a01 0x59fd0e 0x50d256 0x507be4 0x509900 0x50a2fd

我的数据集很大,可能需要超过 128Gb 的内存.有没有办法限制 TF 使用的内存量,如果涉及到这个,我可以接受更长的执行时间.

My dataset is large and will possibly require more than 128Gb memory. Is there way to limit the amount of memory use by TF and I'm fine with longer execution time, if it comes to that.

提前致谢.

推荐答案

我遇到了同样的问题,不得不更改我的 tf 代码.设置最大 GPU 内存并不意味着 tf 会想出一种方法来运行您的代码,而不会尝试分配超过您指定的内存.这适用于我所说的单位".分配,但如果一个单一的操作是巨大的,它就会爆炸.

I have had the same issue and had to change my tf code. Setting the maximum GPU memory does not mean that tf will figure out a way to run your code without trying to allocate more than what you have specified. That works for what I would call "units" of allocation but if one single operation is gigantic, it will blow up.

那么,假设您有一个无法在 GPU 上安装的大量矩阵乘法.Colab 会崩溃.

So, let's supposed that you have a massive matrix multiplication that can't fit on the GPU. Colab will crash.

根据我有限的经验,您有两个选择:

Based on my limited experience, you have 2 options:

  1. 更改您的设置以不使用 GPU(并且会降低性能)
  2. 更改您的代码

这篇关于Google Colab Pro 在分配大内存时崩溃的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆