使用许多压缩的numpy文件的Tensorflow数据集 [英] Tensorflow Dataset using many compressed numpy files

查看:141
本文介绍了使用许多压缩的numpy文件的Tensorflow数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型数据集,我想在Tensorflow中进行训练。

I have a large dataset that I would like to use for training in Tensorflow.

数据以压缩的numpy格式存储(使用 numpy.savez_compressed )。每个文件的图像数量因生成方式而异。

The data is stored in compressed numpy format (using numpy.savez_compressed). There are variable numbers of images per file due to the way they are produced.

当前,我使用基于Keras Sequence的生成器对象进行训练,但是我想移动完全不使用Keras的Tensorflow。

Currently I use a Keras Sequence based generator object to train, but I'd like to move entirely to Tensorflow without Keras.

我正在TF网站上查看Dataset API,但我不知道如何使用它来读取numpy数据。

I'm looking at the Dataset API on the TF website, but it is not obvious how I might use this to read numpy data.

我的第一个想法是

import glob
import tensorflow as tf
import numpy as np

def get_data_from_filename(filename):
   npdata = np.load(open(filename))
   return npdata['features'],npdata['labels']

# get files
filelist = glob.glob('*.npz')

# create dataset of filenames
ds = tf.data.Dataset.from_tensor_slices(filelist)
ds.flat_map(get_data_from_filename)

但是,这会将TF Tensor占位符传递给实际的numpy函数,并且numpy需要标准字符串。这将导致错误:

However, this passes a TF Tensor placeholder to a real numpy function and numpy is expecting a standard string. This results in the error:

File "test.py", line 6, in get_data_from_filename
   npdata = np.load(open(filename))
TypeError: coercing to Unicode: need string or buffer, Tensor found

我正在考虑的另一种选择(但似乎有些混乱)是创建一个基于TF占位符的Dataset对象,然后在我的numpy文件的epoch-batch循环中填充它。

The other option I'm considering (but seems messy) is to create a Dataset object built on TF placeholders which I then fill during my epoch-batch loop from my numpy files.

有什么建议吗?

推荐答案

您可以定义一个包装器并使用pyfunc这样:

You can define a wrapper and use pyfunc like this:

def get_data_from_filename(filename):
   npdata = np.load(filename)
   return npdata['features'], npdata['labels']

def get_data_wrapper(filename):
   # Assuming here that both your data and label is float type.
   features, labels = tf.py_func(
       get_data_from_filename, [filename], (tf.float32, tf.float32)) 
   return tf.data.Dataset.from_tensor_slices((features, labels))

# Create dataset of filenames.
ds = tf.data.Dataset.from_tensor_slices(filelist)
ds.flat_map(get_data_wrapper)

如果数据集非常大且存在内存问题,则可以考虑使用交织 parallel_interleave from_generator 方法代替。 from_generator方法内部使用py_func,因此您可以直接读取np文件,然后在python中定义生成器。

If your dataset is very large and you have memory issues, you can consider using a combination of interleave or parallel_interleave and from_generator methods instead. The from_generator method uses py_func internally so you can directly read your np file and then define your generator in python.

这篇关于使用许多压缩的numpy文件的Tensorflow数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆