具有非常大的 HDF5 文件的 Tensorflow-IO 数据集输入管道 [英] Tensorflow-IO Dataset input pipeline with very large HDF5 files

查看:25
本文介绍了具有非常大的 HDF5 文件的 Tensorflow-IO 数据集输入管道的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有非常大的培训 (30Gb) 文件.
由于所有数据都不适合我的可用内存,我想批量读取数据.
我看到有 实现了一种方式 使用 tfio.IODataset.from_hdf5()
函数以这种方式将 HDF5 读入 Tensorflow然后,由于 tf.keras.model.fit()tf.data.Dataset 作为包含样本和目标的输入,我需要将 X 和 Y 压缩在一起然后使用 .batch 和 .prefetch 将必要的数据加载到内存中.为了测试,我尝试将这种方法应用于较小的样本:训练 (9Gb)、验证 (2.5Gb) 和测试 (1.2Gb),我知道它们运行良好,因为它们可以放入内存中并且我有很好的结果(70% 的准确率和 <1 损失).
训练文件存储在 HDF5 文件中,分为样本 (X) 和标签 (Y) 文件,如下所示:

X_learn.hdf5X_val.hdf5X_test.hdf5Y_test.hdf5Y_learn.hdf5Y_val.hdf5

这是我的代码:

BATCH_SIZE = 2048时代 = 100# 从 hdf5 文件的数据集对象创建一个 IODatasetx_val = tfio.IODataset.from_hdf5(path_hdf5_x_val, dataset='/X_val')y_val = tfio.IODataset.from_hdf5(path_hdf5_y_val, dataset='/Y_val')x_test = tfio.IODataset.from_hdf5(path_hdf5_x_test, dataset='/X_test')y_test = tfio.IODataset.from_hdf5(path_hdf5_y_test, dataset='/Y_test')x_train = tfio.IODataset.from_hdf5(path_hdf5_x_train, dataset='/X_learn')y_train = tfio.IODataset.from_hdf5(path_hdf5_y_train, dataset='/Y_learn')# 将样本和相应的标签压缩在一起train = tf.data.Dataset.zip((x_train,y_train)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)test = tf.data.Dataset.zip((x_test,y_test)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)val = tf.data.Dataset.zip((x_val,y_val)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)# 构建模型模型 = build_model()# 使用自定义学习率函数为 Adam 优化器编译模型model.compile(loss='categorical_crossentropy',优化器=亚当(lr=lr_schedule(0)),指标=['准确度'])# 用之前计算的class_weights拟合模型模型拟合(火车,纪元=纪元,class_weight=class_weights_train,验证数据=val,洗牌=真,回调=回调)

此代码运行但损失非常高(300+)并且准确度从一开始就下降到 0(0.30 -> 4*e^-5)......我不明白我做错了什么,我错过了什么吗?

解决方案

在此处提供解决方案(答案部分),即使它存在于评论部分以造福社区.

代码没有问题,实际上是数据(未正确预处理),因此模型无法很好地学习,从而导致奇怪的损失和准确性.

I have very big training (30Gb) files.
Since all the data does not fit in my available RAM, I want to read the data by batch.
I saw that there is Tensorflow-io package which implemented a way to read HDF5 into Tensorflow this way thanks to the function tfio.IODataset.from_hdf5()
Then, since tf.keras.model.fit() takes a tf.data.Dataset as input containing both samples and targets, I need to zip my X and Y together and then use .batch and .prefetch to load in memory just the necessary data. For testing I tried to apply this method to smaller samples: training (9Gb), validation (2.5Gb) and testing (1.2Gb) which I know work well because they can fit into memory and I have good results (70% accuracy and <1 loss).
The training files are stored in HDF5 files split into samples (X) and labels (Y) files like so:

X_learn.hdf5  
X_val.hdf5  
X_test.hdf5  
Y_test.hdf5  
Y_learn.hdf5  
Y_val.hdf5

Here is my code:

BATCH_SIZE = 2048
EPOCHS = 100

# Create an IODataset from a hdf5 file's dataset object  
x_val = tfio.IODataset.from_hdf5(path_hdf5_x_val, dataset='/X_val')
y_val = tfio.IODataset.from_hdf5(path_hdf5_y_val, dataset='/Y_val')
x_test = tfio.IODataset.from_hdf5(path_hdf5_x_test, dataset='/X_test')
y_test = tfio.IODataset.from_hdf5(path_hdf5_y_test, dataset='/Y_test')
x_train = tfio.IODataset.from_hdf5(path_hdf5_x_train, dataset='/X_learn')
y_train = tfio.IODataset.from_hdf5(path_hdf5_y_train, dataset='/Y_learn')
 
# Zip together samples and corresponding labels
train = tf.data.Dataset.zip((x_train,y_train)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)
test = tf.data.Dataset.zip((x_test,y_test)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)
val = tf.data.Dataset.zip((x_val,y_val)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)

# Build the model
model = build_model()
 
# Compile the model with custom learing rate function for Adam optimizer
model.compile(loss='categorical_crossentropy',
               optimizer=Adam(lr=lr_schedule(0)),
               metrics=['accuracy'])

# Fit model with class_weights calculated before
model.fit(train,
          epochs=EPOCHS,
          class_weight=class_weights_train,
          validation_data=val,
          shuffle=True,
          callbacks=callbacks)

This code runs but the loss goes very high (300+) and accuracy drops to 0 (0.30 -> 4*e^-5) right from the beginning... I don't understand what I am doing wrong, am I missing something ?

解决方案

Providing the solution here (Answer Section), even though it is present in the Comment Section for the benefit of the community.

There was no issue with the code, its actually with the data (not preprocessed properly), hence model not able to learning well, which leads to strange loss and accuracy.

这篇关于具有非常大的 HDF5 文件的 Tensorflow-IO 数据集输入管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆