Keras(TensorFlow,CPU):循环中的训练顺序模型吃掉了内存 [英] Keras (TensorFlow, CPU): Training Sequential models in loop eats memory

查看:157
本文介绍了Keras(TensorFlow,CPU):循环中的训练顺序模型吃掉了内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试循环训练1000倍的顺序模型.在每个循环中,我的程序都会泄漏内存,直到用完并收到OOM异常为止.

I am trying to train 1000x of Sequential models in a loop. In every loop my program leaks memory until I run out and get an OOM exception.

我之前已经问过类似的问题 (连续训练多个顺序模型会减慢速度)

I already asked a similar question before (Training multiple Sequential models in a row slows down)

并看到其他人遇到类似问题( Keras :执行超参数网格搜索时内存不足)

and have seen others in similar problems (Keras: Out of memory when doing hyper parameter grid search)

,解决方案是始终在使用完模型后向代码中添加K.clear_session().所以我在上一个问题中做到了,但我仍在泄漏内存

and the solution is always to add K.clear_session() to your code after you have finished using the model. So I did that in my previous question and I am still leaking memory

这里是重现此问题的代码.

Here is code to reproduce the issue.

import random
import time
from keras.models import Sequential
from keras.layers import Dense
from keras import backend as K
import tracemalloc


def run():
    tracemalloc.start()
    num_input_nodes = 12
    num_hidden_nodes = 8
    num_output_nodes = 1

    random_numbers = random.sample(range(1000), 50)
    train_x, train_y = create_training_dataset(random_numbers, num_input_nodes)

    for i in range(100):
        snapshot = tracemalloc.take_snapshot()
        for j in range(10):
            start_time = time.time()
            nn = Sequential()
            nn.add(Dense(num_hidden_nodes, input_dim=num_input_nodes, activation='relu'))
            nn.add(Dense(num_output_nodes))
            nn.compile(loss='mean_squared_error', optimizer='adam')
            nn.fit(train_x, train_y, nb_epoch=300, batch_size=2, verbose=0)
            K.clear_session()
            print("Iteration {iter}. Current time {t}. Took {elapsed} seconds".
                  format(iter=i*10 + j + 1, t=time.strftime('%H:%M:%S'), elapsed=int(time.time() - start_time)))

        top_stats = tracemalloc.take_snapshot().compare_to(snapshot, 'lineno')

        print("[ Top 5 differences ]")
        for stat in top_stats[:5]:
            print(stat)


def create_training_dataset(dataset, input_nodes):
    """
    Outputs a training dataset (train_x, train_y) as numpy arrays.
    Each item in train_x has 'input_nodes' number of items while train_y items are of size 1
    :param dataset: list of ints
    :param input_nodes:
    :return: (numpy array, numpy array), train_x, train_y
    """
    data_x, data_y = [], []
    for i in range(len(dataset) - input_nodes - 1):
        a = dataset[i:(i + input_nodes)]
        data_x.append(a)
        data_y.append(dataset[i + input_nodes])
    return numpy.array(data_x), numpy.array(data_y)

run()

这是我从第一个内存调试打印中获得的输出

Here is the output I get from the first memory debug print

/tensorflow/python/framework/ops.py:121:size = 3485 KiB(+3485 KiB),count = 42343(+42343) /tensorflow/python/framework/ops.py:1400:size = 998 KiB(+998 KiB),count = 8413(+8413) /tensorflow/python/framework/ops.py:116:size = 888 KiB(+888 KiB),count = 32468(+32468) /tensorflow/python/framework/ops.py:1185:size = 795 KiB(+795 KiB),count = 3179(+3179) /tensorflow/python/framework/ops.py:2354:size = 599 KiB(+599 KiB),count = 5886(+5886)

/tensorflow/python/framework/ops.py:121: size=3485 KiB (+3485 KiB), count=42343 (+42343) /tensorflow/python/framework/ops.py:1400: size=998 KiB (+998 KiB), count=8413 (+8413) /tensorflow/python/framework/ops.py:116: size=888 KiB (+888 KiB), count=32468 (+32468) /tensorflow/python/framework/ops.py:1185: size=795 KiB (+795 KiB), count=3179 (+3179) /tensorflow/python/framework/ops.py:2354: size=599 KiB (+599 KiB), count=5886 (+5886)

系统信息:

  • python 3.5
  • 喀拉拉邦(1.2.2)
  • tensorflow(1.0.0)

推荐答案

内存泄漏源于Keras和TensorFlow,它们使用单​​个默认图"存储网络结构,随着内部循环.

The memory leak stems from Keras and TensorFlow using a single "default graph" to store the network structure, which increases in size with each iteration of the inner for loop.

调用K.clear_session()会在迭代之间释放与默认图相关的某些(后端)状态,但会另外调用 tf.reset_default_graph() 是清除Python状态所必需的.

Calling K.clear_session() frees some of the (backend) state associated with the default graph between iterations, but an additional call to tf.reset_default_graph() is needed to clear the Python state.

请注意,可能有一个更有效的解决方案:由于nn不依赖于任何一个循环变量,因此可以在循环外部定义它,并在循环内部重用相同的实例.如果这样做,则无需清除会话或重置默认图形,并且由于可以从迭代之间的缓存中受益,因此性能应该会提高.

Note that there might be a more efficient solution: since nn does not depend on either of the loop variables, you can define it outside the loop, and reuse the same instance inside the loop. If you do that, there is no need to clear the session or reset the default graph, and performance should increase because you benefit from caching between iterations.

这篇关于Keras(TensorFlow,CPU):循环中的训练顺序模型吃掉了内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆