多行对应于单个观察的 Tensorflow 输入管道? [英] Tensorflow input pipeline where multiple rows correspond to a single observation?

查看:26
本文介绍了多行对应于单个观察的 Tensorflow 输入管道?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我刚刚开始使用 Tensorflow,我正在努力正确理解输入管道.

So I've just started using Tensorflow, and I'm struggling to properly understand input pipelines.

我正在研究的问题是序列分类.我正在尝试读取形状为 (100000, 4) 的 CSV 文件.前 3 列是特征,第 4 列是标签.但是 - 数据表示长度为 10 的序列,即第 1-10 行是序列 1,第 11-20 行是序列 2 等等.这也意味着每个标签重复 10 次.

The problem I'm working on is sequence classification. I'm trying to read in a CSV file with shape (100000, 4). First 3 columns are features, 4th column is the label. BUT - the data represents sequences of length 10 i.e. rows 1-10 are sequence 1, rows 11-20 are sequence 2 etc. This also means each label is repeated 10 times.

所以在这个输入管道的某个时刻,我需要重塑我的特征张量,如 tf.reshape(features, [batch_size_, rows_per_ob, input_dim]).并且只取我的标签张量的每 10 行,如 label[::rows_per_ob]

So at some point in this input pipeline, I'll need to reshape my feature tensor like tf.reshape(features, [batch_size_, rows_per_ob, input_dim]). And only take every 10th row of my label tensor like label[::rows_per_ob]

我应该指出的另一件事是,我的实际数据集有数十亿行,因此我必须考虑性能.

Another thing I should point out is that my actual dataset is in the billions of rows so I have to think about performance.

我已经将以下文档和其他帖子中的代码放在一起,但我认为我没有完全理解这一点,因为我看到以下错误:

I've put together the below code from documentation and other posts on here, but I don't think I fully understand this because I'm seeing the following error:

INFO:tensorflow:Errorreported to Coordinator: , Attempting to use uninitialized value input_producer_2/limit_epochs/epochs

INFO:tensorflow:Error reported to Coordinator: , Attempting to use uninitialized value input_producer_2/limit_epochs/epochs

似乎存在超出范围的错误.

There seems to be an out of range error.

一旦我让它们工作,我也无法弄清楚如何处理这些批次.最初,我以为我会重塑它们,然后将它们送入feed_dict",但后来我读到这真的很糟糕,我应该使用 tf.data.Dataset 对象.但我不确定如何将这些批次输入数据集.我也不完全确定在此过程中何时是重塑数据的最佳时间?

I also can't figure out what to do with these batches once I get them working. Initially, I thought I would reshape them then just feed them into "feed_dict", but then I read that this is really bad, and I should be using a tf.data.Dataset object. But I'm not sure how to feed these batches into a Dataset. I'm also not entirely sure when would be the optimal time in this process to reshape my data?

还有一个令人困惑的地方——当你使用带有 Dataset 对象的 Iterator 时,我看到我们使用了 get_next() 方法.这是否意味着 Dataset 中的每个元素都代表一整批数据?那么这是否意味着如果我们想要改变批量大小,我们需要重建整个 Dataset 对象?

And a final point of confusion - when you use an Iterator with a Dataset object, I see that we use the get_next() method. Does this mean that each element in the Dataset represent a full batch of data? And does this then mean that if we want to change the batch size, we need rebuild the entire Dataset object?

我真的很难将所有部分组合在一起.如果有人对我有任何指点,将不胜感激!谢谢!

I'm really struggling to fit all the pieces together. If anyone has any pointers for me, it would be very much appreciated! Thanks!

# import
import tensorflow as tf

# constants
filename = "tensorflow_test_data.csv"
num_rows = 100000
rows_per_ob = 10
batch_size_ = 5
num_epochs_ = 2
num_batches = int(num_rows * num_epochs_ / batch_size_ / rows_per_ob)

# read csv line
def read_from_csv(filename_queue):
    reader = tf.TextLineReader(skip_header_lines=1)
    _, value = reader.read(filename_queue)
    record_defaults = [[0.0], [0.0], [0.0], [0.0]]
    a, b, c, d = tf.decode_csv(value, record_defaults=record_defaults)
    features = tf.stack([a, b, c])
    return features, d

def input_pipeline(filename=filename, batch_size=batch_size_, num_epochs=num_epochs_):
    filename_queue = tf.train.string_input_producer([filename],
                                                    num_epochs=num_epochs,
                                                    shuffle=False)
    x, y = read_from_csv(filename_queue)
    x_batch, y_batch = tf.train.batch([x, y],
                                      batch_size = batch_size * rows_per_ob,
                                      num_threads=1,
                                      capacity=10000)
    return x_batch, y_batch

###
x, y = input_pipeline(filename, batch_size=batch_size_,
                      num_epochs = num_epochs_)

# I imagine using lists is wrong here - this was more just for me to
# see the output
x_list = []
y_list = []
with tf.Session() as sess:
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)
    for _ in range(num_batches):
        x_batch, y_batch = sess.run([x, y])
        x_list.append(x_batch)
        y_list.append(y_batch)
    coord.request_stop()
    coord.join(threads)

推荐答案

您可以使用 tf.data.Dataset 对象来表达整个管道,这可能会使事情稍微简单一些:

You can express the entire pipeline using tf.data.Dataset objects, which might make things slightly easier:

dataset = tf.data.TextLineDataset(filename)

# Skip the header line.
dataset = dataset.skip(1)

# Combine 10 lines into a single observation.   
dataset = dataset.batch(rows_per_ob)

def parse_observation(line_batch):
  record_defaults = [[0.0], [0.0], [0.0], [0.0]]
  a, b, c, d = tf.decode_csv(value, record_defaults=record_defaults)
  features = tf.stack([a, b, c])
  label = d[-1]  # Take the label from the last row.
  return features, label

# Parse each observation into a `row_per_ob X 2` matrix of features and a
# scalar label.
dataset = dataset.map(parse_observation)

# Batch multiple observations.
dataset = dataset.batch(batch_size)

# Optionally add a prefetch for performance.
dataset = dataset.prefetch(1)

要使用数据集中的值,您可以创建一个 tf.data.Iterator 来获取作为一对 tf.Tensor 对象的下一个元素,然后使用这些作为模型的输入.

To use the values from the dataset, you can make a tf.data.Iterator to get the next element as a pair of tf.Tensor objects, then use these as the input to your model.

iterator = dataset.make_one_shot_iterator()

features_batch, label_batch = iterator.get_next()

# Use the `features_batch` and `label_batch` tensors as the inputs to
# the model, rather than fetching them and feeding them via the `Session`
# interface.
train_op = build_model(features_batch, label_batch)

这篇关于多行对应于单个观察的 Tensorflow 输入管道?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆