如何使用 TensorFlow tf.train.string_input_producer 生成多个 epochs 数据? [英] How to use TensorFlow tf.train.string_input_producer to produce several epochs data?

查看:16
本文介绍了如何使用 TensorFlow tf.train.string_input_producer 生成多个 epochs 数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我想使用 tf.train.string_input_producer 加载 2 个 epoch 的数据时,我使用了

When I want to use tf.train.string_input_producer to load data for 2 epochs, I used

filename_queue = tf.train.string_input_producer(filenames=['data.csv'], num_epochs=2, shuffle=True)

col1_batch, col2_batch, col3_batch = tf.train.shuffle_batch([col1, col2, col3], batch_size=batch_size, capacity=capacity,min_after_dequeue=min_after_dequeue, allow_smaller_final_batch=True)

但是后来我发现这个操作没有产生我想要的.

But then I found that this op did not produce what I want.

data.csv中的每个样本只能生成2次,但是生成的顺序不明确.例如,data.csv

It can only produce each sample in data.csv for 2 times, but the generated order is not clearly. For example, 3 line data in data.csv

[[1]
[2]
[3]]

它会产生(每个样本只出现2次,但顺序是可选的)

it will produce (which each sample just appear 2 times, but the order is optional)

[1]
[1]
[3]
[2]
[2]
[3]

但我想要的是(每个纪元都是独立的,在每个纪元中洗牌)

but what I want is (each epoch is separate, shuffle in each epoch)

(epoch 1:)
[1]
[2]
[3]
(epoch 2:)
[1]
[3]
[2]

另外,如何知道 1 个 epoch 何时完成?是否有一些标志变量?谢谢!

In addition, how to know when 1 epoch was done? Is there some flag variables? Thanks!

我的代码在这里.

import tensorflow as tf

def read_my_file_format(filename_queue):
    reader = tf.TextLineReader()
    key, value = reader.read(filename_queue)
    record_defaults = [['1'], ['1'], ['1']]  
    col1, col2, col3 = tf.decode_csv(value, record_defaults=record_defaults, field_delim='-')
    # col1 = list(map(int, col1.split(',')))
    # col2 = list(map(int, col2.split(',')))
    return col1, col2, col3

def input_pipeline(filenames, batch_size, num_epochs=1):
  filename_queue = tf.train.string_input_producer(
    filenames, num_epochs=num_epochs, shuffle=True)
  col1,col2,col3 = read_my_file_format(filename_queue)

  min_after_dequeue = 10
  capacity = min_after_dequeue + 3 * batch_size
  col1_batch, col2_batch, col3_batch = tf.train.shuffle_batch(
    [col1, col2, col3], batch_size=batch_size, capacity=capacity,
    min_after_dequeue=min_after_dequeue, allow_smaller_final_batch=True)
  return col1_batch, col2_batch, col3_batch

filenames=['1.txt']
batch_size = 3
num_epochs = 1
a1,a2,a3=input_pipeline(filenames, batch_size, num_epochs)

with tf.Session() as sess:
  sess.run(tf.local_variables_initializer())
  # start populating filename queue
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)
  try:
    while not coord.should_stop():
      a, b, c = sess.run([a1, a2, a3])
      print(a, b, c)
  except tf.errors.OutOfRangeError:
    print('Done training, epoch reached')
  finally:
    coord.request_stop()

  coord.join(threads) 

我的数据就像

1,2-3,4-A
7,8-9,10-B
12,13-14,15-C
17,18-19,20-D
22,23-24,25-E
27,28-29,30-F
32,33-34,35-G
37,38-39,40-H

推荐答案

正如 Nicolas 观察到的tf.train.string_input_producer() API 不给你检测何时到达纪元结束的能力;相反,它将所有时期连接在一起成为一个长批次.为此,我们最近(在 TensorFlow 1.2 中)添加了 tf.contrib.data API,可以表达更复杂的管道,包括您的用例.

As Nicolas observes, the tf.train.string_input_producer() API does not give you the ability to detect when the end of an epoch is reached; instead it concatenates together all epochs into one long batch. For this reason, we recently added (in TensorFlow 1.2) the tf.contrib.data API, which makes it possible to express more sophisticated pipelines, including your use case.

以下代码片段展示了如何使用 tf.contrib.data 编写程序:

The following code snippet shows how you would write your program using tf.contrib.data:

import tensorflow as tf

def input_pipeline(filenames, batch_size):
    # Define a `tf.contrib.data.Dataset` for iterating over one epoch of the data.
    dataset = (tf.contrib.data.TextLineDataset(filenames)
               .map(lambda line: tf.decode_csv(
                    line, record_defaults=[['1'], ['1'], ['1']], field_delim='-'))
               .shuffle(buffer_size=10)  # Equivalent to min_after_dequeue=10.
               .batch(batch_size))

    # Return an *initializable* iterator over the dataset, which will allow us to
    # re-initialize it at the beginning of each epoch.
    return dataset.make_initializable_iterator() 

filenames=['1.txt']
batch_size = 3
num_epochs = 10
iterator = input_pipeline(filenames, batch_size)

# `a1`, `a2`, and `a3` represent the next element to be retrieved from the iterator.    
a1, a2, a3 = iterator.get_next()

with tf.Session() as sess:
    for _ in range(num_epochs):
        # Resets the iterator at the beginning of an epoch.
        sess.run(iterator.initializer)

        try:
            while True:
                a, b, c = sess.run([a1, a2, a3])
                print(a, b, c)
        except tf.errors.OutOfRangeError:
            # This will be raised when you reach the end of an epoch (i.e. the
            # iterator has no more elements).
            pass                 

        # Perform any end-of-epoch computation here.
        print('Done training, epoch reached')

这篇关于如何使用 TensorFlow tf.train.string_input_producer 生成多个 epochs 数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆