如何确保培训阶段不会面对OOM? [英] How to make sure the training phase won't be facing an OOM?

查看：98 发布时间：2021/2/14 20:45:42 tensorflow machine-learning keras out-of-memory

本文介绍了如何确保培训阶段不会面对OOM?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

标题中的问题已完成. 如何确保培训阶段不会面对OOM?"

The question in the title is complete. "How to make sure the training phase won't be facing an OOM?"

仅需注意一点，根据我的经验，有两种OOM.一种是您的模型和微型批处理所需的内存大于您拥有的内存.在这种情况下，培训阶段将永远不会开始.解决此问题的解决方案是使用较小的批量.即使我能够计算出我的硬件可以为某些特定型号管理的最大批量大小，那也很棒.但是，即使我第一次尝试找不到最大的批处理大小，也总是可以通过反复试验找到它(因为该过程立即失败).

Just some side notes, based on my experience there are two cases of OOM. One is when the memory needed for your model and the mini-batch is bigger than the memory you have. In such cases, the training phase will never start. And the solution to fix this is to use smaller batch sizes. Even though it would have been great if I could calculate the biggest batch size that my hardware can manage for some particular model. But even if I cannot find the biggest batch size with the first try, I can always find it with some trial and error (since the process fails right away).

我面对OOM的第二种情况是训练过程开始时，它持续了一段时间.甚至几个时代.但是由于某种未知的原因，它面临OOM.对我来说，这种情况令人沮丧.因为它可能随时发生，并且您永远不会知道正在进行的培训是否会结束.到目前为止，我已经失去了几天的培训时间，而我以为一切都很好.

The second scenario that I'm facing OOM is when the training process starts, and it goes on for some time. Maybe even a few epochs. But then for some unknown reason, it faces the OOM. For me, this scenario is a frustrating one. Because it could happen any time and you'll never know if the training that is ongoing will ever finish or not. So far, I have lost days of training time while I thought everything is going forward just fine.

我认为需要进行一些澄清.首先，我说的是带有GPU的个人计算机.其次，GPU专用于计算，不用于显示.如果我错了，请纠正我，但是我相信这意味着训练过程需要在不同的时间点使用不同的内存大小.怎么会这样再一次，如何确保我的培训阶段不会面对OOM?

I think some clarifications are in order. First of all, I'm talking about a personal computer with a GPU. And secondly, the GPU is dedicated to computing and is not used for display. Correct me if I'm wrong, but I believe this means that the training process demands different memory sizes at different points of time. How could that be? And once again, how can I make sure that my training phase won't be facing an OOM?

以这次运行为例:

3150/4073 [======================>.......] - ETA: 53:39 - loss: 0.3323
2019-10-13 21:41:13.096320: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 60.81MiB (rounded to 63766528).  Current allocation summary follows.

经过三个小时的培训，TensorFlow要求的内存超出了我的硬件所能提供的数量.我的问题是，为什么此时而不是在流程开始时增加内存分配?

After three hours of training, TensorFlow asked for more memory than my hardware could provide. My question is, why this increase of memory allocation at this point and not at the beginning of the process?

[UPDATE]

鉴于急切模式的已知问题，我将对我的情况进行一些说明.我不是在渴望模式下编码.这是我的训练代码的样子:

In the light of known issues with eager mode, I'll add some clarification to my case. I'm not coding in eager mode. And here's how my training code looks like:

strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
training_dataset = tf.data.Dataset.from_tensor_slices(...)
validation_dataset = tf.data.Dataset.from_tensor_slices(...)

with strategy.scope():
    model = create_model()

    model.compile(optimizer='adam', loss='categorical_crossentropy')

    pocket = EarlyStopping(monitor='val_loss', min_delta=0.001,
                           patience=5, verbose=1,
                           restore_best_weights = True)

    history = model.fit(training_dataset.shuffle(buffer_size=1000).batch(30),
                        epochs=3,
                        callbacks=[pocket],
                        validation_data=validation_dataset.shuffle(buffer_size=1000).batch(30),
                        workers=3, use_multiprocessing=True)

如何确保培训阶段不会面对OOM? [英] How to make sure the training phase won't be facing an OOM?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

如何确保培训阶段不会面对OOM? [英] How to make sure the training phase won&#39;t be facing an OOM?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

如何确保培训阶段不会面对OOM? [英] How to make sure the training phase won't be facing an OOM?

登录关闭