张量流的缓冲区欠载和ResourceExhausted错误 [英] Buffer underrun and ResourceExhausted errors with tensorflow

查看:190
本文介绍了张量流的缓冲区欠载和ResourceExhausted错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在读高中,并且正在尝试做一个涉及神经网络的项目.我正在使用Ubuntu并尝试使用tensorflow进行强化学习,但是在训练神经网络时,我始终会收到很多欠速警告.它们采用ALSA lib pcm.c:7963:(snd_pcm_recover) underrun occurred的形式.随着训练的进行,此消息越来越多地显示在屏幕上.最终,我收到一个ResourceExhaustedError,程序终止.这是完整的错误消息:

I'm in high school and I'm trying to do a project involving neural networks. I am using Ubuntu and trying to do reinforcement learning with tensorflow, but I consistently get lots of underrun warnings when I train a neural network. They take the form of ALSA lib pcm.c:7963:(snd_pcm_recover) underrun occurred. This message is printed to the screen more and more frequently as training progresses. Eventually, I get a ResourceExhaustedError and the program terminates. Here is the full error message:

W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[320000,512]
Traceback (most recent call last):
  File "./train.py", line 121, in <module>
    loss, _ = model.train(minibatch, gamma, sess) # Train the model based on the batch, the discount factor, and the tensorflow session.
  File "/home/perrin/neural/dqn.py", line 174, in train
    return sess.run([self.loss, self.optimize], feed_dict=self.feed_dict) # Runs the training.  This is where the underrun errors happen
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[320000,512]
     [[Node: gradients/fully_connected/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](dropout/mul, gradients/fully_connected/BiasAdd_grad/tuple/control_dependency)]]

Caused by op u'gradients/fully_connected/MatMul_grad/MatMul_1', defined at:
  File "./train.py", line 72, in <module>
    model = AC_Net([None, 201, 201, 3], 5, trainer) # This creates the neural network using the imported AC_Net class.
  File "/home/perrin/neural/dqn.py", line 128, in __init__
    self.optimize = trainer.minimize(self.loss) # This tells the trainer to adjust the weights in such a way as to minimize the loss.  This is what actually
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 269, in minimize
    grad_loss=grad_loss)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 335, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py", line 482, in gradients
    in_grads = grad_fn(op, *out_grads)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_grad.py", line 731, in _MatMulGrad
    math_ops.matmul(op.inputs[0], grad, transpose_a=True))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1729, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1442, in _mat_mul
    transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

...which was originally created as op u'fully_connected/MatMul', defined at:
  File "./train.py", line 72, in <module>
    model = AC_Net([None, 201, 201, 3], 5, trainer) # This creates the neural network using the imported AC_Net class.
  File "/home/perrin/neural/dqn.py", line 63, in __init__
    net = slim.fully_connected(net, 512, activation_fn=tf.nn.elu, scope='fully_connected') # Feeds the input through a fully connected layer
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1350, in fully_connected
    outputs = standard_ops.matmul(inputs, weights)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1729, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1442, in _mat_mul
    transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[320000,512]
     [[Node: gradients/fully_connected/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](dropout/mul, gradients/fully_connected/BiasAdd_grad/tuple/control_dependency)]]

我研究了这些问题,但是对如何解决这些问题并不清楚.我对编程很陌生,所以对缓冲区和数据读/写的工作原理不甚了解.这些错误使我非常困惑.有谁知道我的代码的哪些部分可能会导致这种情况以及如何解决?感谢您抽出宝贵的时间考虑这个问题!

I researched these problems but didn't get a clear idea of how I could fix them. I am pretty new to programming so I don't know much about how buffers and data reading/writing works. I am very perplexed by these errors. Does anyone know what parts of my code might be causing this and how to fix it? Thanks for taking the time to consider this question!

这是我用于定义神经网络的代码(基于

Here is my code for defining the neural network (based on this tutorial):

#! /usr/bin/python

import numpy as np
import tensorflow as tf
slim = tf.contrib.slim

# The neural network
class AC_Net:
    # This defines the actual neural network.
    # output_size:  the number of outputs of the policy
    # trainer:  the tensorflow training optimizer used by the network
    def __init__(self, input_shape, output_size, trainer):

        with tf.name_scope('input'):
            self.input = tf.placeholder(shape=list(input_shape), dtype=tf.float32, name='input')
            net = tf.image.per_image_standardization(self.input[0])
            net = tf.expand_dims(net, [0])

        with tf.name_scope('convolution'):
            net = slim.conv2d(net, 32, [8, 8], activation_fn=tf.nn.elu, scope='conv')
            net = slim.max_pool2d(net, [2, 2], scope='pool')

        net = slim.flatten(net)
        net = tf.nn.dropout(net, .5)
        net = slim.fully_connected(net, 512, activation_fn=tf.nn.elu, scope='fully_connected')
        net = tf.nn.dropout(net, .5)

        with tf.name_scope('LSTM'):
            cell = tf.nn.rnn_cell.BasicLSTMCell(256, state_is_tuple=True, activation=tf.nn.elu)

            with tf.name_scope('state_in'):
                state_in = cell.zero_state(tf.shape(net)[0], tf.float32)

            net = tf.expand_dims(net, [0])  
            step_size = tf.shape(self.input)[:1]
            output, state = tf.nn.dynamic_rnn(cell, net, initial_state=state_in, sequence_length=step_size, time_major=False, scope='LSTM')

        out = tf.reshape(output, [-1, 256])
        out = tf.nn.dropout(out, .5)
        self.policy = slim.fully_connected(out, output_size, activation_fn=tf.nn.softmax, scope='policy')

        self.value = slim.fully_connected(out, 1, activation_fn=None, scope='value')

        # Defines the loss functions
        with tf.name_scope('loss_function'):
            self.target_values = tf.placeholder(dtype=tf.float32, name='target_values') # The target value is the discounted reward.
            self.actions = tf.placeholder(dtype=tf.int32, name='actions') # This is the network's policy.
            # The advantage is the difference between what the network thought the value of an action was, and what it actually was.
            # It is computed as R - V(s), where R is the discounted reward and V(s) is the value of being in state s.   
            self.advantages = tf.placeholder(dtype=tf.float32, name='advantages') 

            with tf.name_scope('entropy'):
                entropy = -tf.reduce_sum(tf.log(self.policy + 1e-10) * self.policy)
            with tf.name_scope('responsible_actions'):
                actions_onehot = tf.one_hot(self.actions, output_size, dtype=tf.float32)    
                responsible_actions = tf.reduce_sum(self.policy * actions_onehot, [1]) # This returns only the actions that were selected. 

            with tf.name_scope('loss'):

                with tf.name_scope('value_loss'):
                    self.value_loss = tf.reduce_sum(tf.square(self.target_values - tf.reshape(self.value, [-1])))

                with tf.name_scope('policy_loss'):
                    self.policy_loss = -tf.reduce_sum(tf.log(responsible_actions + 1e-10) * self.advantages)

                with tf.name_scope('total_loss'):
                    self.loss = self.value_loss + self.policy_loss - entropy * .01

                tf.summary.scalar('loss', self.loss)

        with tf.name_scope('gradient_clipping'):
            tvars = tf.trainable_variables()
            grads = tf.gradients(self.loss, tvars)          
            grads, _ = tf.clip_by_global_norm(grads, 20.)
        self.optimize = trainer.apply_gradients(zip(grads, tvars))

    def predict(self, inputs, sess):
        return sess.run([self.policy, self.value], feed_dict={self.input:inputs})

    def train(self, train_batch, gamma, sess):

        inputs = train_batch[:, 0]
        actions = train_batch[:, 1]
        rewards = train_batch[:, 2]
        values = train_batch[:, 4]

        discounted_rewards = rewards[::-1]
        for i, j in enumerate(discounted_rewards):
            if i > 0:
                discounted_rewards[i] += discounted_rewards[i - 1] * gamma
        discounted_rewards = np.array(discounted_rewards, np.float32)[::-1] 
        advantages = discounted_rewards - values 
        self.feed_dict = {
                self.input:np.vstack(inputs), 
                self.target_values:discounted_rewards, 
                self.actions:actions,
                self.advantages:advantages
                }
        return sess.run([self.loss, self.optimize], feed_dict=self.feed_dict)

这是我训练神经网络的代码:

Here is my code for training the neural network:

#! /usr/bin/python

import game_env, move_right, move_right_with_obs, random, inspect, os
import tensorflow as tf
import numpy as np
from dqn import AC_Net

def process_outputs(x):
    a = [int(x > 2), int(x%2 == 0 and x > 0)*2-int(x > 0)]  
    return a

environment = game_env # The environment to use
env_name = str(inspect.getmodule(environment).__name__) # The name of the environment

ep_length = 2000
num_episodes = 20

total_steps = ep_length * num_episodes # The total number of steps
model_path = '/home/perrin/neural/nn/' + env_name

learning_rate = 1e-4 # The learning rate
trainer = tf.train.AdamOptimizer(learning_rate=learning_rate) # The gradient descent optimizer used
first_epsilon = 0.6 # The initial chance of random action
final_epsilon = 0.01 # The final chance of random action
gamma = 0.9
anneal_steps = 35000 # The number of steps it takes to go from initial to random

count = 0 # Keeps track of the number of steps we've run
experience_buffer = [] # Stores the agent's experiences in a list
buffer_size = 10000 # How large the experience buffer can be
train_step = 256 # How often to train the model
batches_per_train = 10
save_step = 500 # How often to save the trained model
batch_size = 256 # How many experiences to train on at once
env_size = 500 # How many pixels tall and wide the environment should be.
load_model = True # Whether or not to load a pretrained model
train = True # Whether or not to train the model
test = False # Whether or not to test the model

tf.reset_default_graph()

sess = tf.InteractiveSession()

model = AC_Net([None, 201, 201, 3], 5, trainer)
env = environment.Env(env_size)
action = [0, 0]
state, _ = env.step(True, action)

saver = tf.train.Saver() # This saves the model
epsilon = first_epsilon
tf.global_variables_initializer().run()

if load_model:
    ckpt = tf.train.get_checkpoint_state(model_path)
    saver.restore(sess, ckpt.model_checkpoint_path) 
    print 'Model loaded'

prev_out = None

while count <= total_steps and train:

    if random.random() < epsilon or count == 0:
        if prev_out is not None:
            out = prev_out
        if random.randint(0, 100) == 100 or prev_out is None:
            out = np.random.rand(5)
            out = np.array([val/np.sum(out) for val in out])
            _, value = model.predict(state, sess)
            prev_out = out

    else:
        out, value = model.predict(state, sess)
        out = out[0]
    act = np.random.choice(out, p=out)
    act = np.argmax(out == act)
    act1 = process_outputs(act)
    action[act1[0]] = act1[1]
    _, reward = env.step(True, action)
    new_state = env.get_state()

    experience_buffer.append((state, act, reward, new_state, value[0, 0]))

    state = new_state

    if len(experience_buffer) > buffer_size:
        experience_buffer.pop(0)

    if count % train_step == 0 and count > 0:
        print "Training model"
        for i in range(batches_per_train):
        # Get a random sample of experiences and train the model based on it.
            x = random.randint(0, len(experience_buffer)-batch_size)
            minibatch = np.array(experience_buffer[x:x+batch_size])
            loss, _ = model.train(minibatch, gamma, sess)
            print "Loss for batch", str(i+1) + ":", loss


    if count % save_step == 0 and count > 0:
        saver.save(sess, model_path+'/model-'+str(count)+'.ckpt')
        print "Model saved"

    if count % ep_length == 0 and count > 0:
        print "Starting new episode"
        env = environment.Env(env_size)

    if epsilon > final_epsilon:
        epsilon -= (first_epsilon - final_epsilon)/anneal_steps

    count += 1

while count <= total_steps and test:
    out, _ = model.predict(state, sess)
    out = out[0]
    act = np.random.choice(out, p=out)
    act = np.argmax(out == act)
    act1 = process_outputs(act)
    action[act1[0]] = act1[1]
    state, reward = env.step(True, action)
    new_state = env.get_state()
    count += 1

# Write log files to create tensorboard visualizations
merged = tf.summary.merge_all()
writer = tf.summary.FileWriter('/home/perrin/neural/summaries', sess.graph)
if train:
    summary = sess.run(merged, feed_dict=model.feed_dict)
    writer.add_summary(summary)
writer.flush()

推荐答案

您的内存不足.您的网络可能需要比运行的内存更多的内存,因此,跟踪过多的内存使用情况的第一步是弄清楚正在使用多少内存.

You are running out of memory. It's possible that your network requires more memory than you have to run, so the first step to tracking down excessive memory usage is to figure out what is using so much memory.

这是使用时间轴和statssummarizer的一种方法: https://gist.github.com/yaroslavvb/08afccbe087171881ceafc0c98abca05

Here's one approach that uses timeline and statssummarizer: https://gist.github.com/yaroslavvb/08afccbe087171881ceafc0c98abca05

这将打印出几个表,其中一个表是按顶部内存使用量排序的张量.您应该检查一下那里没有异常大的东西.

This will print out several tables, one of the tables is the tensors sorted by top memory usage. You should check that you don't have something unusually large in there.

您还可以使用Chrome可视化工具查看内存时间轴,如

You can also see memory timeline using Chrome visualizer, as detailed here

更高级的技术是绘制内存分配/取消分配的时间线,如本

A more advanced technique is to plot a timeline of memory allocations/deallocations, as done in this issue

从理论上讲,如果您不创建新的有状态操作(变量),则您不应该在步骤之间增加内存使用量,但是我发现,如果张量的大小在步骤之间更改,则全局内存分配可能会增加.

Theoretically your memory usage shouldn't grow between steps if you aren't creating new stateful ops (Variables), but I found that global memory allocation can grow if sizes of your tensors change between steps.

一种解决方法是定期将您的参数保存到检查点并重新启动脚本.

A work-around is to periodically save your parameters to checkpoint and restart your script.

这篇关于张量流的缓冲区欠载和ResourceExhausted错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆