不完整批次的Tensorflow训练 [英] Tensorflow Train on incomplete batch

查看:133
本文介绍了不完整批次的Tensorflow训练的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对tensorflow中的批次进行训练.因为我可以分批完成第一个纪元,所以这有点奏效. 我的代码目前有2个问题.
1.第一个纪元完成后,第二个纪元立即进入except tf.errors.OutOfRangeError,而下一个纪元不会从顶部重新开始批处理.我该如何再次进行批量生产?
2.我打印了batchnr,发现最后一个纪元打印了print(batchnr),但没有打印print(End batchnr),而是转到了除外,并且没有受过训练.这是因为队列中剩余的行数小于我估计的批处理大小.我怎么仍然可以训练最后一部分?

I'm trying to do training with batches in tensorflow. This works a little since I can do the first epoch in batches. I currently have 2 problems with my code.
1. After the first epoch has finished the second epoch immediatly goes to the except tf.errors.OutOfRangeError and the next epoch doesn't restart the batch from the top. How can I do another epoch where it gives batches again?
2. I print the batchnr and I notice that the last batch of the epoch prints print(batchnr) but doesn't print print(End batchnr) and goes to the except and does not get trained. This is because the amount of rows left in the queue is less than the size of the batch size I guess. How can I still train that last part batch?

我的火车方法和管道方法

My train method and pipeline method

def input_pipeline(file, batch_size, num_epochs=None):
  filename_queue = tf.train.string_input_producer([file], num_epochs=num_epochs, shuffle=True)
  example, label = read_from_csv(filename_queue)
  min_after_dequeue = 10000
  capacity = min_after_dequeue + 3 * 2
  example_batch, label_batch = tf.train.shuffle_batch(
      [example, label], batch_size=batch_size, capacity=capacity,
      min_after_dequeue=min_after_dequeue)
  return example_batch, label_batch

def train():
    examples, labels = input_pipeline(training_data_file, batch_size, 1)
    saver = tf.train.Saver()
    prediction = neural_network_model(p_inputdata)
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=p_known_labels))
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

    init = tf.group(tf.initialize_all_variables(),
                    tf.initialize_local_variables())
    with tf.Session() as sess:
        sess.run(init) # initialize all variables **in** the session

        correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(p_known_labels, 1))
        accuracy = tf.reduce_mean(tf.cast(correct, 'float'))

        latest_cost_of_batch = None
        for e in range(epochs):
            epoch = e + 1
            coord = tf.train.Coordinator()
            threads = tf.train.start_queue_runners(coord=coord)
            try:
                batchnr = 1
                while not coord.should_stop():
                    print(batchnr)
                    batch_data, batch_labels = sess.run([examples, labels])
                    batch_labels_output = get_output_values(batch_labels)
                    print("End", batchnr)
                    batchnr += 1

                    _, latest_cost_of_batch = sess.run([optimizer,cost], feed_dict={
                        p_inputdata: batch_data,
                        p_known_labels: batch_labels_output
                    })

            except tf.errors.OutOfRangeError:
                print('Done training, epoch reached')
                if (epoch) % print_each_x_number_of_epochs == 0 or epoch == 0:
                    print('Epoch', epoch, 'completed out of', epochs, "---", 'Loss', latest_cost_of_batch)
                if epoch % save_each_x_number_of_epochs == 0:
                    saver.save(sess, checkpoint_label)
            finally:
                coord.request_stop()
        coord.join(threads)

        print("Trained for ", epochs,"epochs. Saving variables...")
        saver.save(sess, checkpoint_label)
        print("Variables saved. Training finished.")
    end = time.time()
    seconds = end - start
    print("Total runtime:", str(datetime.timedelta(seconds=seconds)))

调试控制台

Start training
1
End 1
2
End 2
....
213
End 213
214
Done training, epoch reached
Epoch 1 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 2 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 3 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 4 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 5 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 6 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 7 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 8 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 9 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 10 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 11 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 12 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 13 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 14 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 15 completed out of 15 --- Loss 4.43414
Trained for  15 epochs. Saving variables...
Variables saved. Training finished.
Accuracy 0.935310311615 % after 15 epochs of training
Total runtime: 0:00:21.395917

编辑
我根据Nicolas的答案更改了代码(我在string_input_producer中使用了多个时期).现在我要训练以下代码:

EDIT
I changed the code based on the answer by Nicolas( I went with the multiple epochs in the string_input_producer). Now I have for training the following code:

def train():
    """Trains the neural network  
    """
    examples, labels = input_pipeline(training_data_file, batch_size, epochs)
    start = time.time()
    saver = tf.train.Saver()
    prediction = neural_network_model(p_inputdata)
    first_no_loss = True
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=p_known_labels))
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

    init = tf.group(tf.initialize_all_variables(),
                    tf.initialize_local_variables())
    with tf.Session() as sess:
        sess.run(init) # initialize all variables **in** the session
        correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(p_known_labels, 1))
        accuracy = tf.reduce_mean(tf.cast(correct, 'float'))

        print("Start training")
        latest_cost_of_batch = None

        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(coord=coord)
        epoch_op = "input_producer/limit_epochs/epochs:0"
        try:
            batchnr = 1
            epochs_var = 0
            while not coord.should_stop():
                if (batchnr) % print_each_x_number_of_batches == 0:
                    print('Batch', batchnr, 'completed of epoch', epochs_var, "---", 'Loss', latest_cost_of_batch)

                if  batchnr > 3194:
                    print("GETTING BATCH", batchnr)
                epochs_var, batch_data, batch_labels = sess.run([epoch_op, examples, labels])
                batch_labels_output = get_output_values(batch_labels)
                if  batchnr > 3194:
                    print("GOT BATCH", batchnr)
                batchnr += 1
                _, latest_cost_of_batch = sess.run([optimizer,cost], feed_dict={
                    p_inputdata: batch_data,
                    p_known_labels: batch_labels_output
                })

        except tf.errors.OutOfRangeError:
            print('Done training, epoch reached')
        finally:
            coord.request_stop()

        coord.join(threads)

        print("Trained for ", epochs,"epochs. Saving variables...")
        saver.save(sess, checkpoint_label)
        print("Variables saved. Training finished.")
        labels, values, output = get_training_or_testdata(training_data_file)
        print('Accuracy', accuracy.eval(feed_dict={p_inputdata: values, p_known_labels: output}) * 100, '% after', epochs, 'epochs of training')
    end = time.time()
    seconds = end - start
    print("Total runtime:", str(datetime.timedelta(seconds=seconds)))

我的输出是这样的

Start training
Batch 100 completed of epoch 15 --- Loss 4.79351
Batch 200 completed of epoch 15 --- Loss 4.57468
Batch 300 completed of epoch 15 --- Loss 4.51134
Batch 400 completed of epoch 15 --- Loss 4.65865
Batch 500 completed of epoch 15 --- Loss 4.55456
Batch 600 completed of epoch 15 --- Loss 4.63549
Batch 700 completed of epoch 15 --- Loss 4.53037
Batch 800 completed of epoch 15 --- Loss 4.49263
Batch 900 completed of epoch 15 --- Loss 4.37
Batch 1000 completed of epoch 15 --- Loss 4.42719
Batch 1100 completed of epoch 15 --- Loss 4.4518
Batch 1200 completed of epoch 15 --- Loss 4.41053
Batch 1300 completed of epoch 15 --- Loss 4.43508
Batch 1400 completed of epoch 15 --- Loss 4.32173
Batch 1500 completed of epoch 15 --- Loss 4.36624
Batch 1600 completed of epoch 15 --- Loss 4.44027
Batch 1700 completed of epoch 15 --- Loss 4.37201
Batch 1800 completed of epoch 15 --- Loss 4.24956
Batch 1900 completed of epoch 15 --- Loss 4.40256
Batch 2000 completed of epoch 15 --- Loss 4.18391
Batch 2100 completed of epoch 15 --- Loss 4.30156
Batch 2200 completed of epoch 15 --- Loss 4.38423
Batch 2300 completed of epoch 15 --- Loss 4.23823
Batch 2400 completed of epoch 15 --- Loss 4.17783
Batch 2500 completed of epoch 15 --- Loss 4.31024
Batch 2600 completed of epoch 15 --- Loss 4.26312
Batch 2700 completed of epoch 15 --- Loss 4.26143
Batch 2800 completed of epoch 15 --- Loss 4.16691
Batch 2900 completed of epoch 15 --- Loss 4.48624
Batch 3000 completed of epoch 15 --- Loss 4.1347
Batch 3100 completed of epoch 15 --- Loss 4.20801
GETTING BATCH 3195
GOT BATCH 3195
GETTING BATCH 3196
GOT BATCH 3196
GETTING BATCH 3197
Done training, epoch reached
Trained for  15 epochs. Saving variables...
Variables saved. Training finished.
Accuracy 2.69019026309 % after 15 epochs of training
Total runtime: 0:03:07.577149

我注意到的事情是最后一批仍未得到训练(GOT BATCH 3197未得到打印),其次,获取当前纪元的方法不正确.它总是15.没有说明获取当前纪元的正确方法.有任何线索吗?

The things that I noticed is that still the last batch doesn't get trained(GOT BATCH 3197 doesn't get printed) and second that the way to get the current epoch isn't correct. It is always 15. Another SO question answer explained why the way I do it now is not the way to go but it doesn't explain a proper way to get the current epoch. Any clues?

推荐答案


您可能想看看这个答案,其中提供了新API的示例.

这里是您所得到的解释.


you might want to have a look at this answer at it gives an example of the new API.

Here is an explanation of what you got.

  • 第一次执行for e in range(epochs)循环时,它将使数据队列中的所有内容出队(直到数据队列抛出tf.errors.OutOfRangeError为止).

  • The first time you go through the for e in range(epochs) loop, it dequeues everything from your data queue (until the data queue throws tf.errors.OutOfRangeError).

当文件名队列中没有更多文件名时,抛出此错误.仅读取一次文件后会发生这种情况,这是因为您调用了examples, labels = input_pipeline(training_data_file, batch_size, 1).

This error is thrown when there is no more filenames in the filename queue. Which happens after reading the file only once, and this because you called examples, labels = input_pipeline(training_data_file, batch_size, 1).

例如,如果您呼叫了examples, labels = input_pipeline(training_data_file, batch_size, 3),则在移至e=1之前,要经过3次文件操作.

If, for example, you had called examples, labels = input_pipeline(training_data_file, batch_size, 3), you would have gone 3 times though the files before moving to e=1.

然后,当您移至e>0时,文件名队列已保存在内存中,您已经使所有文件名出队,并且由于没有更多的入队操作,它将直接抛出tf.errors.OutOfRangeError.

Then when you move to e>0, the filename queue has kept in memory that you already dequeued all the file names and as there is no more enqueue operation it throws the tf.errors.OutOfRangeError directly.

请参阅字符串文档:

注意:如果num_epochs不是None,则此函数创建本地计数器 epochs.使用local_variables_initializer()初始化局部变量.

Note: if num_epochs is not None, this function creates local counter epochs. Use local_variables_initializer() to initialize local variables.

你能做什么?

  1. 您可以在for e in range(epochs)循环中移动会话上下文管理器:

  1. You move the session context manager in the for e in range(epochs) loop:

init_queue = tf.variables_initializer(tf.get_collection(tf.GraphKeys.LOCAL_VARIABLES, scope='input_producer'))`
with tf.Session() as sess:
    sess.run(init)
for e in range(EPOCHS):
    with tf.Session() as sess:
        sess.run(init_queue) # initialize all local variables **in** the the input_producer scope
        epoch = e + 1

这意味着您将重新初始化input_producer范围内的所有局部变量,因此您需要注意它们的含义. 您还可以保存模型并在每个步骤再次加载它,或者

It would mean that you reinitialize all your local variables in the input_producer scope, so you would need to be careful about what they are. You could also save your model and load it again at each step, or

您依靠num_epochs参数运行正确数量的纪元并删除您的for e in range(EPOCHS)循环.您可以每100或1000个训练步骤(我最喜欢的解决方案)打印信息,而不是在每个时期结束时打印信息.如果您确实想在每个纪元末尾打印信息,则可以尝试访问隐藏的epochs变量,评估其值,并在有纪元"变化时打印信息(此选项).

You rely on the num_epochs argument to run the right number of epochs and remove your for e in range(EPOCHS) loop. Instead of printing information at the end of each epoch you could print information every 100 or 1000 training steps (my favourite solution). If you really want to print information at the end of each epoch, you could try to access the hidden epochs variable, eval its value and print the information whenever there is an 'epochs' change (I wouldn't recommend this option).

例如:

    batchnr = 0
    tmp_batchnr = 0
    while not coord.should_stop():
            if batchnr != tmp_batchnr:
                print(....)
                batchnr = tmp_batchnr
            epochs_var, _, _ = sess.run([epochs_var, examples, labels])
            print("End", batchnr)
            batchnr += 1

希望有帮助!

关于已编辑问题的评论:

REMARKS ON THE EDITED QUESTION:

从您所引用的答案看这句话中强调的内容,在我看来,您无法知道出队属于哪个时期.

Looking at the emphasised in this quote from the answer you referred to, it looks to me that you have no way of knowing from which epoch the dequeue belongs to.

执行tf.start_queue_runners()时,所有纪元都排队在一起(如果容量少于文件名数量,则分多个阶段). tf.train.string_input_producer使用局部变量epochs:0来维护要排队的纪元.当epochs:0达到num_epochs时,它将保持不变,并且无论有多少线程从队列中出队,它都不会改变.

When tf.start_queue_runners() is executed, all the epochs are enqueued together (in multiple stages if capacity is less than number of filenames). The local variable epochs:0 is used by tf.train.string_input_producer to maintain the epoch that is being enqueued. Once epochs:0 reaches num_epochs, it remains constant and no matter how many threads are dequeuing from the queue, it does not change.

这篇关于不完整批次的Tensorflow训练的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆