具有可变批量大小的 TensorFlow DataSet `from_generator` [英] TensorFlow DataSet `from_generator` with variable batch size

查看:52
本文介绍了具有可变批量大小的 TensorFlow DataSet `from_generator`的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 TensorFlow Dataset API 读取 HDF5 文件,使用 from_generator 方法.除非批量大小没有均匀地划分为事件数量,否则一切正常.我不太明白如何使用 API 进行灵活的批处理.

I'm trying to use the TensorFlow Dataset API to read an HDF5 file, using the from_generator method. Everything works fine unless the batch size does not evenly divide into the number of events. I don't quite see how to make a flexible batch using the API.

如果事情没有平均分配,你会得到如下错误:

If things don't divide evenly, you get errors like:

2018-08-31 13:47:34.274303: W tensorflow/core/framework/op_kernel.cc:1263] Invalid argument: ValueError: `generator` yielded an element of shape (1, 28, 28, 1) where an element of shape (11, 28, 28, 1) was expected.
Traceback (most recent call last):

  File "/Users/perdue/miniconda3/envs/py3a/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 206, in __call__
    ret = func(*args)

  File "/Users/perdue/miniconda3/envs/py3a/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 452, in generator_py_func
    "of shape %s was expected." % (ret_array.shape, expected_shape))

ValueError: `generator` yielded an element of shape (1, 28, 28, 1) where an element of shape (11, 28, 28, 1) was expected.

我有一个脚本可以在此处重现错误(以及获取几个 MB 所需数据文件的说明 - Fashion MNIST):

I have a script that reproduces the error (and instructions to get the several MB required data file - Fashion MNIST) here:

https://gist.github.com/gnperdue/b905a9c2dd4c08b53e053d3dp

https://gist.github.com/gnperdue/b905a9c2dd4c08b53e0539d6aa3d3dc6

最重要的代码大概是:

def make_fashion_dset(file_name, batch_size, shuffle=False):
    dgen = _make_fashion_generator_fn(file_name, batch_size)
    features_shape = [batch_size, 28, 28, 1]
    labels_shape = [batch_size, 10]
    ds = tf.data.Dataset.from_generator(
        dgen, (tf.float32, tf.uint8),
        (tf.TensorShape(features_shape), tf.TensorShape(labels_shape))
    )
    ...

其中 dgen 是从 hdf5 读取的生成器函数:

where dgen is a generator function reading from the hdf5:

def _make_fashion_generator_fn(file_name, batch_size):
    reader = FashionHDF5Reader(file_name)
    nevents = reader.openf()

    def example_generator_fn():
        start_idx, stop_idx = 0, batch_size
        while True:
            if start_idx >= nevents:
                reader.closef()
                return
            yield reader.get_examples(start_idx, stop_idx)
            start_idx, stop_idx = start_idx + batch_size, stop_idx + batch_size

    return example_generator_fn

问题的核心是我们必须在 from_generator 中声明张量形状,但我们需要在迭代时灵活地改变该形状.

The core of the problem is we have to declare the tensor shapes in from_generator, but we need the flexibility to change that shape down the line while iterating.

有一些解决方法 - 删除最后几个样本以进行均匀划分,或者只使用批量大小为 1...但如果您不能丢失任何样本并且批量大小为 1 非常慢.

There are some workarounds - drop the last few samples to get even division, or just use a batch size of 1... but the first is bad if you can't lose any samples and a batch size of 1 is very slow.

有什么想法或意见吗?谢谢!

Any ideas or comments? Thanks!

推荐答案

from_generator 中指定 Tensor 形状时,可以使用 None 作为元素来指定可变大小方面.通过这种方式,您可以容纳不同大小的批次,特别是比您请求的批次大小略小的剩余"批次.所以你会使用

When specifying Tensor shapes in from_generator, you can use None as an element to specify variable-sized dimensions. This way you can accommodate batches of different sizes, in particular "leftover" batches that are a bit smaller than your requested batch size. So you would use

def make_fashion_dset(file_name, batch_size, shuffle=False):
    dgen = _make_fashion_generator_fn(file_name, batch_size)
    features_shape = [None, 28, 28, 1]
    labels_shape = [None, 10]
    ds = tf.data.Dataset.from_generator(
        dgen, (tf.float32, tf.uint8),
        (tf.TensorShape(features_shape), tf.TensorShape(labels_shape))
    )
    ...

这篇关于具有可变批量大小的 TensorFlow DataSet `from_generator`的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆