将以tfrecord格式存储的数据转换为Tensorflow中的lstm Keras模型的输入,并使用该数据对该模型进行拟合 [英] Transforming the data stored in tfrecord format to become inputs to a lstm Keras model in Tensorflow and fitting the model with that data

查看:534
本文介绍了将以tfrecord格式存储的数据转换为Tensorflow中的lstm Keras模型的输入,并使用该数据对该模型进行拟合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常长的数据帧(2500万行x 500列),可以将其作为csv文件或镶木地板文件进行访问,但可以加载到PC的RAM中.

I have a very long dataframe (25 million rows x 500 columns) which I can access as a csv file or a parquet file but I can load into the RAM of my PC.

数据应该适当地成形,以便输入给Keras LSTM模型(Tensorflow 2),给定每个样品所需的时间戳数量和每个批次所需的样品数量.

The data should be shaped appropriately in order to become input to a Keras LSTM model (Tensorflow 2), given a desired number of timestamps per sample and a desired number of samples per batch.

这是我在该主题上的第二篇文章.我已经被建议将数据转换为tfrecord格式.

This is my second post in this subject. I have already been given the advice to convert the data to tfrecord format.

由于我的原始环境将是PySpark,因此进行此转换的方式将是:

Since my original environment will be PySpark the way to do this transformation would be:

myDataFrame.write.format("tfrecords").option("writeLocality", "local").save("/path") 

如何将多个Parquet文件转换为TFrecord是使用SPARK文件吗?

现在假设已完成此操作并简化操作并使它们具体且可重现,我们假设一个数据框形状为1000行x 3列,其中前两列为特征,最后两列为目标,而每一行对应一个时间戳.

Assuming now that this has been done and to simplify things and make them concrete and reproducible let's assume a dataframe shaped 1000 rows x 3 columns where the first two columns are features and the last is the target, while each row corresponds to a timestamp.

例如,第一列是温度,第二列是wind_speed,第三列(目标)是energy_consumption.每行对应一个小时.数据集包含1,000个连续小时的观察值.我们假设在任何给定时间的能量消耗是数小时之前大气状态的函数.因此,我们想使用一个lstm模型来估计能耗.我们决定向lstm模型提供样本,每个样本都包含前5个小时的数据(即每个样本5行).为简单起见,假定目标已向后移动一小时,因此切片data[0:4, :-1]具有目标data[3, -1].假设为batch_size = 32.

For example the first column is temperature, the second column is wind_speed and the third column (the target) is energy_consumption. Each row corresponds to an hour. The dataset contains observations of 1,000 consecutive hours. We assume that the energy consumption at any given hour is a function of the state of the atmosphere over several hours before. Therefore, we want to use an lstm model to estimate energy consumption. We have decided to feed the lstm model with samples each of which contains the data from the previous 5 hours (i.e. 5 rows per sample). For simplicity assume that the target has been shifted backwards one hour so that a slice data[0:4, :-1] has as target data[3, -1]. Assume as batch_size = 32.

数据以.tfrecords格式存储在我们的硬盘中.我们无法将所有数据加载到RAM中.

The data are in our hard disk in .tfrecords format. We can not load all the data to our RAM.

我们将如何处理?

推荐答案

我不明白这个问题.这与tfrecord s开箱即用:

I don't understand the question. This works out of the box with tfrecords:

# this will not load all data into RAM
dataset = tf.data.TFRecordDataset("./path_to_tfrecord.tfrecord")
k = 0
for sample in dataset:
  print(sample.numpy())

训练

model.fit(train_data=dataset)

您能提供一些打印出来的样本吗? (如果需要,可使用"..."来缩短内容).

Can you give a few samples of what gets printed? (With "..."s to shorten stuff if necessary).

这篇关于将以tfrecord格式存储的数据转换为Tensorflow中的lstm Keras模型的输入,并使用该数据对该模型进行拟合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆