向TensorFlow添加多层会导致损失函数变为Nan [英] Adding multiple layers to TensorFlow causes loss function to become Nan

查看:96
本文介绍了向TensorFlow添加多层会导致损失函数变为Nan的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在TensorFlow/Python中为 notMNIST 数据集编写神经网络分类器.我已经在隐藏层上实现了l2正则化和辍学.只要只有一个隐藏层,它就可以正常工作,但是当我添加更多层(以提高准确性)时,损失函数在每一步都会迅速增加,到步骤5变为NaN.我尝试暂时禁用Dropout和L2正则化,但是只要有2个以上的层,我就会得到相同的行为.我什至从头重写了我的代码(进行了一些重构以使其更加灵活),但是结果却相同.层的数量和大小由hidden_layer_spec控制.我想念什么?

I'm writing a neural-network classifier in TensorFlow/Python for the notMNIST dataset. I've implemented l2 regularization and dropout on the hidden layers. It works fine as long as there is only one hidden layer, but when I added more layers (to improve accuracy), the loss function increases rapidly at each step, becoming NaN by step 5. I tried temporarily disabling Dropout and L2 regularization, but I get the same behavior as long as there are 2+ layers. I even rewrote my code from scratch (doing some refactoring to make it more flexible), but with the same results. The number and size of layers is controlled by hidden_layer_spec. What am I missing?

#works for np.array([1024]) with about 96.1% accuracy
hidden_layer_spec = np.array([1024, 300])
num_hidden_layers = hidden_layer_spec.shape[0]
batch_size = 256
beta = 0.0005

epochs = 100
stepsPerEpoch = float(train_dataset.shape[0]) / batch_size
num_steps = int(math.ceil(float(epochs) * stepsPerEpoch))

l2Graph = tf.Graph()
with l2Graph.as_default():
  #with tf.device('/cpu:0'):
      # Input data. For the training data, we use a placeholder that will be fed
      # at run time with a training minibatch.
      tf_train_dataset = tf.placeholder(tf.float32,
                                        shape=(batch_size, image_size * image_size))
      tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
      tf_valid_dataset = tf.constant(valid_dataset)
      tf_test_dataset = tf.constant(test_dataset)

      weights = []
      biases = []
      for hi in range(0, num_hidden_layers + 1):
        width = image_size * image_size if hi == 0 else hidden_layer_spec[hi - 1]
        height = num_labels if hi == num_hidden_layers else hidden_layer_spec[hi]
        weights.append(tf.Variable(tf.truncated_normal([width, height]), name = "w" + `hi + 1`))
        biases.append(tf.Variable(tf.zeros([height]), name = "b" + `hi + 1`))
        print(`width` + 'x' + `height`)

      def logits(input, addDropoutLayer = False):
        previous_layer = input
        for hi in range(0, hidden_layer_spec.shape[0]):
          previous_layer = tf.nn.relu(tf.matmul(previous_layer, weights[hi]) + biases[hi])
          if addDropoutLayer:
            previous_layer = tf.nn.dropout(previous_layer, 0.5)
        return tf.matmul(previous_layer, weights[num_hidden_layers]) + biases[num_hidden_layers]

      # Training computation.
      train_logits = logits(tf_train_dataset, True)

      l2 = tf.nn.l2_loss(weights[0])
      for hi in range(1, len(weights)):
        l2 = l2 + tf.nn.l2_loss(weights[0])
      loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(train_logits, tf_train_labels)) + beta * l2

      # Optimizer.
      global_step = tf.Variable(0)  # count the number of steps taken.
      learning_rate = tf.train.exponential_decay(0.5, global_step, int(stepsPerEpoch) * 2, 0.96, staircase = True)
      optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)

      # Predictions for the training, validation, and test data.
      train_prediction = tf.nn.softmax(train_logits)
      valid_prediction = tf.nn.softmax(logits(tf_valid_dataset))
      test_prediction = tf.nn.softmax(logits(tf_test_dataset))
      saver = tf.train.Saver()

with tf.Session(graph=l2Graph) as session:
  tf.initialize_all_variables().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 500 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Learning rate: " % learning_rate)
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
  save_path = saver.save(session, "l2_degrade.ckpt")
  print("Model save to " + `save_path`)

推荐答案

事实证明,这与其说是深度学习问题,还不如说是一个编码问题.多余的层使梯度太不稳定,从而导致损失函数迅速降级为NaN.解决此问题的最佳方法是使用 Xavier初始化.否则,初始值的方差将趋于太大,从而导致不稳定.另外,降低学习率可能会有所帮助.

Turns out this was not so much a coding issue as a Deep Learning Issue. The extra layer made the gradients too unstable, and that lead to the loss function quickly devolving to NaN. The best way to fix this is to use Xavier initialization. Otherwise, the variance of the initial values will tend to be too high, causing instability. Also, decreasing the learning rate may help.

这篇关于向TensorFlow添加多层会导致损失函数变为Nan的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆