如何使用 tf.data 和 map 批量加载 CSV 文件 [英] How to load batches of CSV files using tf.data and map
问题描述
很长一段时间以来,我一直在寻找有关如何解决此问题的答案,但似乎找不到任何有效的方法.
I have been searching for an answer as to how I should go about this for quite some time and can't seem to find anything that works.
我正在学习使用 tf.data API 的教程 此处.我的场景与本教程中的场景非常相似(即我有 3 个包含所有训练/验证/测试文件的目录),但是,它们不是图像,而是另存为 CSV 的频谱图.
I am following a tutorial on using the tf.data API found here. My scenario is very similar to the one in this tutorial (i.e. I have 3 directories containing all the training/validation/test files), however, they are not images, they're spectrograms saved as CSVs.
我找到了几种读取 CSV 行的解决方案,其中每一行都是一个训练实例(例如,如何在 TensorFlow 中*实际*读取 CSV 数据?).但我对这个实现的问题是必需的 record_defaults
参数,因为 CSV 是 500x200.
I have found a couple solutions for reading lines of a CSV where each line is a training instance (e.g., How to *actually* read CSV data in TensorFlow?). But my issue with this implementation is the required record_defaults
parameter as the CSVs are 500x200.
这是我的想法:
import tensorflow as tf
import pandas as pd
def load_data(path, label):
# This obviously doesn't work because path and label
# are Tensors, but this is what I had in mind...
data = pd.read_csv(path, index_col=0).values()
return data, label
X_train = tf.constant(training_files) # training_files is a list of the file names
Y_train = tf.constant(training_labels # training_labels is a list of labels for each file
train_data = tf.data.Dataset.from_tensor_slices((X_train, Y_train))
# Here is where I thought I would do the mapping of 'load_data' over each batch
train_data = train_data.batch(64).map(load_data)
iterator = tf.data.Iterator.from_structure(train_data.output_types, \
train_data.output_shapes)
next_batch = iterator.get_next()
train_op = iterator.make_initializer(train_data)
我过去只使用过 Tensorflows feed_dict
,但现在我需要一种不同的方法,因为我的数据已经达到无法再放入内存的大小.
I have only used Tensorflows feed_dict
in the past, but I need a different approach now that my data has gotten to the size that it can no longer fit in memory.
有什么想法吗?谢谢.
推荐答案
我使用 Tensorflow (2.0) tf.data 来读取我的 csv 数据集.我每个班级都有几个文件夹.每个文件夹都包含数千个数据点的 csv 文件.下面是我用于数据输入管道的代码.希望这会有所帮助.
I use Tensorflow (2.0) tf.data to read my csv dataset. I have several folders for each class. Each folder contains thousands of csv files of data points. Below is the code I use for the data input pipeline. Hope this helps.
import tensorflow as tf
def tf_parse_filename(filename):
def parse_filename(filename_batch):
data = []
labels = []
for filename in filename_batch:
# Read data
filename_str = filename.numpy().decode()
# Read .csv file
data_point= np.loadtxt(filename_str, delimiter=',')
# Create label
current_label = get_label(filename)
label = np.zeros(n_classes, dtype=np.float32)
label[current_label] = 1.0
data.append(data_point)
labels.append(label)
return np.stack(data), np.stack(labels)
x, y = tf.py_function(parse_filename, [filename], [tf.float32, tf.float32])
return x, y
train_ds = tf.data.Dataset.from_tensor_slices(TRAIN_FILES)
train_ds = train_ds.batch(BATCH_SIZE, drop_remainder=True)
train_ds = train_ds.map(tf_parse_filename, num_parallel_calls=AUTOTUNE)
train_ds = train_ds.prefetch(buffer_size=AUTOTUNE)
#Train on epochs
for i in range(num_epochs):
# Train on batches
for x_train, y_train in train_ds:
train_step(x_train, y_train)
print('Training done!')
TRAIN_FILES"是一个矩阵(例如熊猫数据框),其中第一列是数据点的标签,第二列是包含数据点的 csv 文件的路径.
"TRAIN_FILES" is a matrix (e.g. pandas dataframe) where the first column is the label of a data point and the second column is the path to the csv file containing the data point.
这篇关于如何使用 tf.data 和 map 批量加载 CSV 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!