将.npy(numpy文件)馈入tensorflow数据管道 [英] Feeding .npy (numpy files) into tensorflow data pipeline

查看:610
本文介绍了将.npy(numpy文件)馈入tensorflow数据管道的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Tensorflow似乎缺少用于".npy"文件的阅读器. 如何将我的数据文件读入新的tensorflow.data.Dataset点线中? 我的数据无法容纳在内存中.

Tensorflow seems to lack a reader for ".npy" files. How can I read my data files into the new tensorflow.data.Dataset pipline? My data doesn't fit in memory.

每个对象都保存在单独的".npy"文件中.每个文件都包含2个不同的ndarray作为特征,并包含一个标量作为其标签.

Each object is saved in a separate ".npy" file. each file contains 2 different ndarrays as features and a scalar as their label.

推荐答案

您的数据是否适合内存?如果是这样,您可以按照文档使用NumPy数组部分中的说明进行操作. :

Does your data fit into memory? If so, you can follow the instructions from the Consuming NumPy Arrays section of the docs:

使用NumPy数组

Consuming NumPy arrays

如果所有输入数据都适合内存,从它们创建数据集的最简单方法是将它们转换为tf.Tensor对象并使用Dataset.from_tensor_slices().

If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices().

# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
  features = data["features"]
  labels = data["labels"]

# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]

dataset = tf.data.Dataset.from_tensor_slices((features, labels))

如果文件不适合内存,似乎唯一推荐的方法是先将npy数据转换为TFRecord格式,然后使用TFRecord数据集格式,无需完全加载到内存就可以流式传输.

In the case that the file doesn't fit into memory, it seems like the only recommended approach is to first convert the npy data into a TFRecord format, and then use the TFRecord data set format, which can be streamed without fully loading into memory.

此处是带有一些说明的帖子.

FWIW,对于我来说,TFRecord不能直接用npy文件的目录名或文件名实例化似乎很疯狂,但这似乎是普通Tensorflow的局限性.

FWIW, it seems crazy to me that TFRecord cannot be instantiated with a directory name or file name(s) of npy files directly, but it appears to be a limitation of plain Tensorflow.

如果您可以将单个大的npy文件拆分为较小的文件,每个文件大致代表一个批次进行训练,那么您可以在Keras中编写一个自定义数据生成器,该数据生成器将仅生成当前批次所需的数据.

If you can split the single large npy file into smaller files that each roughly represent one batch for training, then you could write a custom data generator in Keras that would yield only the data needed for the current batch.

通常,如果您的数据集无法容纳在内存中,则将其存储为一个大的npy文件非常困难,因此最好您首先将数据重新格式化为TFRecord或多个npy文件,然后再格式化使用其他方法.

In general, if your dataset cannot fit in memory, storing it as one single large npy file makes it very hard to work with, and preferably you should reformat the data first, either as TFRecord or as multiple npy files, and then use other methods.

这篇关于将.npy(numpy文件)馈入tensorflow数据管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆