如何使用TensorFlow tf.train.string_input_producer生成几个纪元数据? [英] How to use TensorFlow tf.train.string_input_producer to produce several epochs data?

查看:334
本文介绍了如何使用TensorFlow tf.train.string_input_producer生成几个纪元数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我想使用tf.train.string_input_producer加载2个时期的数据时,我使用

When I want to use tf.train.string_input_producer to load data for 2 epochs, I used

filename_queue = tf.train.string_input_producer(filenames=['data.csv'], num_epochs=2, shuffle=True)

col1_batch, col2_batch, col3_batch = tf.train.shuffle_batch([col1, col2, col3], batch_size=batch_size, capacity=capacity,\min_after_dequeue=min_after_dequeue, allow_smaller_final_batch=True)

但是后来我发现这个操作没有达到我想要的效果.

But then I found that this op did not produce what I want.

它只能将data.csv中的每个样本生成2次,但是生成的顺序不清楚.例如,data.csv

It can only produce each sample in data.csv for 2 times, but the generated order is not clearly. For example, 3 line data in data.csv

[[1]
[2]
[3]]

它将产生(每个样本仅出现2次,但顺序是可选的)

it will produce (which each sample just appear 2 times, but the order is optional)

[1]
[1]
[3]
[2]
[2]
[3]

但是我想要的是(每个纪元是分开的,每个纪元都随机播放)

but what I want is (each epoch is separate, shuffle in each epoch)

(epoch 1:)
[1]
[2]
[3]
(epoch 2:)
[1]
[3]
[2]

此外,如何知道何时完成1个纪元?是否有一些标志变量?谢谢!

In addition, how to know when 1 epoch was done? Is there some flag variables? Thanks!

我的代码在这里.

import tensorflow as tf

def read_my_file_format(filename_queue):
    reader = tf.TextLineReader()
    key, value = reader.read(filename_queue)
    record_defaults = [['1'], ['1'], ['1']]  
    col1, col2, col3 = tf.decode_csv(value, record_defaults=record_defaults, field_delim='-')
    # col1 = list(map(int, col1.split(',')))
    # col2 = list(map(int, col2.split(',')))
    return col1, col2, col3

def input_pipeline(filenames, batch_size, num_epochs=1):
  filename_queue = tf.train.string_input_producer(
    filenames, num_epochs=num_epochs, shuffle=True)
  col1,col2,col3 = read_my_file_format(filename_queue)

  min_after_dequeue = 10
  capacity = min_after_dequeue + 3 * batch_size
  col1_batch, col2_batch, col3_batch = tf.train.shuffle_batch(
    [col1, col2, col3], batch_size=batch_size, capacity=capacity,
    min_after_dequeue=min_after_dequeue, allow_smaller_final_batch=True)
  return col1_batch, col2_batch, col3_batch

filenames=['1.txt']
batch_size = 3
num_epochs = 1
a1,a2,a3=input_pipeline(filenames, batch_size, num_epochs)

with tf.Session() as sess:
  sess.run(tf.local_variables_initializer())
  # start populating filename queue
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)
  try:
    while not coord.should_stop():
      a, b, c = sess.run([a1, a2, a3])
      print(a, b, c)
  except tf.errors.OutOfRangeError:
    print('Done training, epoch reached')
  finally:
    coord.request_stop()

  coord.join(threads) 

我的数据就像

1,2-3,4-A
7,8-9,10-B
12,13-14,15-C
17,18-19,20-D
22,23-24,25-E
27,28-29,30-F
32,33-34,35-G
37,38-39,40-H

推荐答案

Nicolas观察 tf.train.string_input_producer() API不能让您检测到某个时期的结束时间到达;而是将所有纪元串联在一起,组成一个较长的批处理.因此,我们最近在TensorFlow 1.2中添加了 tf.contrib.data API ,这使表达更复杂的管道(包括您的用例)成为可能.

As Nicolas observes, the tf.train.string_input_producer() API does not give you the ability to detect when the end of an epoch is reached; instead it concatenates together all epochs into one long batch. For this reason, we recently added (in TensorFlow 1.2) the tf.contrib.data API, which makes it possible to express more sophisticated pipelines, including your use case.

以下代码段显示了如何使用tf.contrib.data编写程序:

The following code snippet shows how you would write your program using tf.contrib.data:

import tensorflow as tf

def input_pipeline(filenames, batch_size):
    # Define a `tf.contrib.data.Dataset` for iterating over one epoch of the data.
    dataset = (tf.contrib.data.TextLineDataset(filenames)
               .map(lambda line: tf.decode_csv(
                    line, record_defaults=[['1'], ['1'], ['1']], field_delim='-'))
               .shuffle(buffer_size=10)  # Equivalent to min_after_dequeue=10.
               .batch(batch_size))

    # Return an *initializable* iterator over the dataset, which will allow us to
    # re-initialize it at the beginning of each epoch.
    return dataset.make_initializable_iterator() 

filenames=['1.txt']
batch_size = 3
num_epochs = 10
iterator = input_pipeline(filenames, batch_size)

# `a1`, `a2`, and `a3` represent the next element to be retrieved from the iterator.    
a1, a2, a3 = iterator.get_next()

with tf.Session() as sess:
    for _ in range(num_epochs):
        # Resets the iterator at the beginning of an epoch.
        sess.run(iterator.initializer)

        try:
            while True:
                a, b, c = sess.run([a1, a2, a3])
                print(a, b, c)
        except tf.errors.OutOfRangeError:
            # This will be raised when you reach the end of an epoch (i.e. the
            # iterator has no more elements).
            pass                 

        # Perform any end-of-epoch computation here.
        print('Done training, epoch reached')

这篇关于如何使用TensorFlow tf.train.string_input_producer生成几个纪元数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆