如何确保培训阶段不会面对OOM? [英] How to make sure the training phase won't be facing an OOM?

查看:98
本文介绍了如何确保培训阶段不会面对OOM?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

标题中的问题已完成. 如何确保培训阶段不会面对OOM?"

The question in the title is complete. "How to make sure the training phase won't be facing an OOM?"

仅需注意一点,根据我的经验,有两种OOM.一种是您的模型和微型批处理所需的内存大于您拥有的内存.在这种情况下,培训阶段将永远不会开始.解决此问题的解决方案是使用较小的批量.即使我能够计算出我的硬件可以为某些特定型号管理的最大批量大小,那也很棒.但是,即使我第一次尝试找不到最大的批处理大小,也总是可以通过反复试验找到它(因为该过程立即失败).

Just some side notes, based on my experience there are two cases of OOM. One is when the memory needed for your model and the mini-batch is bigger than the memory you have. In such cases, the training phase will never start. And the solution to fix this is to use smaller batch sizes. Even though it would have been great if I could calculate the biggest batch size that my hardware can manage for some particular model. But even if I cannot find the biggest batch size with the first try, I can always find it with some trial and error (since the process fails right away).

我面对OOM的第二种情况是训练过程开始时,它持续了一段时间.甚至几个时代.但是由于某种未知的原因,它面临OOM.对我来说,这种情况令人沮丧.因为它可能随时发生,并且您永远不会知道正在进行的培训是否会结束.到目前为止,我已经失去了几天的培训时间,而我以为一切都很好.

The second scenario that I'm facing OOM is when the training process starts, and it goes on for some time. Maybe even a few epochs. But then for some unknown reason, it faces the OOM. For me, this scenario is a frustrating one. Because it could happen any time and you'll never know if the training that is ongoing will ever finish or not. So far, I have lost days of training time while I thought everything is going forward just fine.

我认为需要进行一些澄清.首先,我说的是带有GPU的个人计算机.其次,GPU专用于计算,不用于显示.如果我错了,请纠正我,但是我相信这意味着训练过程需要在不同的时间点使用不同的内存大小.怎么会这样再一次,如何确保我的培训阶段不会面对OOM?

I think some clarifications are in order. First of all, I'm talking about a personal computer with a GPU. And secondly, the GPU is dedicated to computing and is not used for display. Correct me if I'm wrong, but I believe this means that the training process demands different memory sizes at different points of time. How could that be? And once again, how can I make sure that my training phase won't be facing an OOM?

以这次运行为例:

3150/4073 [======================>.......] - ETA: 53:39 - loss: 0.3323
2019-10-13 21:41:13.096320: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 60.81MiB (rounded to 63766528).  Current allocation summary follows.

经过三个小时的培训,TensorFlow要求的内存超出了我的硬件所能提供的数量.我的问题是,为什么此时而不是在流程开始时增加内存分配?

After three hours of training, TensorFlow asked for more memory than my hardware could provide. My question is, why this increase of memory allocation at this point and not at the beginning of the process?

[UPDATE]

鉴于急切模式的已知问题,我将对我的情况进行一些说明.我不是在渴望模式下编码.这是我的训练代码的样子:

In the light of known issues with eager mode, I'll add some clarification to my case. I'm not coding in eager mode. And here's how my training code looks like:

strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
training_dataset = tf.data.Dataset.from_tensor_slices(...)
validation_dataset = tf.data.Dataset.from_tensor_slices(...)

with strategy.scope():
    model = create_model()

    model.compile(optimizer='adam', loss='categorical_crossentropy')

    pocket = EarlyStopping(monitor='val_loss', min_delta=0.001,
                           patience=5, verbose=1,
                           restore_best_weights = True)

    history = model.fit(training_dataset.shuffle(buffer_size=1000).batch(30),
                        epochs=3,
                        callbacks=[pocket],
                        validation_data=validation_dataset.shuffle(buffer_size=1000).batch(30),
                        workers=3, use_multiprocessing=True)

推荐答案

如果您在循环中反复训练,则会发生已知的内存泄漏[1].解决该问题的方法是在循环中不时调用tf.keras.backend.clear_session(),并且可能不时调用gc.collect().

There's a known memory leak [1] that happens if you train repeatedly in a loop. The solution to that is call tf.keras.backend.clear_session() and possibly gc.collect() every now and then in the loop.

尽管在TF 1.15和2.0中,行为似乎有所不同,但这可能不足以对其进行修复.我发现我在CPU上的训练循环中的tf.keras.backend.clear_session()重置了逐渐的内存泄漏,而没有影响训练.

The behavior seems to be different in TF 1.15 and 2.0 though and this might not be enough to fix it. I find that tf.keras.backend.clear_session() in my training loop on CPU resets a gradual memory leak without hurting the training.

[1] https://github.com/tensorflow/tensorflow/issues/30324

这篇关于如何确保培训阶段不会面对OOM?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆