如何使用 tf.data 和 map 批量加载 CSV 文件 [英] How to load batches of CSV files using tf.data and map

查看:27
本文介绍了如何使用 tf.data 和 map 批量加载 CSV 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

很长一段时间以来,我一直在寻找有关如何解决此问题的答案,但似乎找不到任何有效的方法.

I have been searching for an answer as to how I should go about this for quite some time and can't seem to find anything that works.

我正在学习使用 tf.data API 的教程 此处.我的场景与本教程中的场景非常相似(即我有 3 个包含所有训练/验证/测试文件的目录),但是,它们不是图像,而是另存为 CSV 的频谱图.

I am following a tutorial on using the tf.data API found here. My scenario is very similar to the one in this tutorial (i.e. I have 3 directories containing all the training/validation/test files), however, they are not images, they're spectrograms saved as CSVs.

我找到了几种读取 CSV 行的解决方案,其中每一行都是一个训练实例(例如,如何在 TensorFlow 中*实际*读取 CSV 数据?).但我对这个实现的问题是必需的 record_defaults 参数,因为 CSV 是 500x200.

I have found a couple solutions for reading lines of a CSV where each line is a training instance (e.g., How to *actually* read CSV data in TensorFlow?). But my issue with this implementation is the required record_defaults parameter as the CSVs are 500x200.

这是我的想法:

import tensorflow as tf
import pandas as pd

def load_data(path, label):
   # This obviously doesn't work because path and label
   # are Tensors, but this is what I had in mind...
   data = pd.read_csv(path, index_col=0).values()
   return data, label

X_train = tf.constant(training_files)  # training_files is a list of the file names
Y_train = tf.constant(training_labels  # training_labels is a list of labels for each file

train_data = tf.data.Dataset.from_tensor_slices((X_train, Y_train))

# Here is where I thought I would do the mapping of 'load_data' over each batch
train_data = train_data.batch(64).map(load_data)

iterator = tf.data.Iterator.from_structure(train_data.output_types, \
                                           train_data.output_shapes)
next_batch = iterator.get_next()
train_op = iterator.make_initializer(train_data)

我过去只使用过 Tensorflows feed_dict,但现在我需要一种不同的方法,因为我的数据已经达到无法再放入内存的大小.

I have only used Tensorflows feed_dict in the past, but I need a different approach now that my data has gotten to the size that it can no longer fit in memory.

有什么想法吗?谢谢.

推荐答案

我使用 Tensorflow (2.0) tf.data 来读取我的 csv 数据集.我每个班级都有几个文件夹.每个文件夹都包含数千个数据点的 csv 文件.下面是我用于数据输入管道的代码.希望这会有所帮助.

I use Tensorflow (2.0) tf.data to read my csv dataset. I have several folders for each class. Each folder contains thousands of csv files of data points. Below is the code I use for the data input pipeline. Hope this helps.

import tensorflow as tf

def tf_parse_filename(filename):

    def parse_filename(filename_batch):
        data = []
        labels = []
        for filename in filename_batch:
            # Read data
            filename_str = filename.numpy().decode()
            # Read .csv file 
            data_point= np.loadtxt(filename_str, delimiter=',')

            # Create label
            current_label = get_label(filename)
            label = np.zeros(n_classes, dtype=np.float32)
            label[current_label] = 1.0

            data.append(data_point)
            labels.append(label)

        return np.stack(data), np.stack(labels)


    x, y = tf.py_function(parse_filename, [filename], [tf.float32, tf.float32])
    return x, y

train_ds = tf.data.Dataset.from_tensor_slices(TRAIN_FILES)
train_ds = train_ds.batch(BATCH_SIZE, drop_remainder=True)
train_ds = train_ds.map(tf_parse_filename, num_parallel_calls=AUTOTUNE)
train_ds = train_ds.prefetch(buffer_size=AUTOTUNE)

#Train on epochs
for i in range(num_epochs):
    # Train on batches
    for x_train, y_train in train_ds:
        train_step(x_train, y_train)

print('Training done!')

TRAIN_FILES"是一个矩阵(例如熊猫数据框),其中第一列是数据点的标签,第二列是包含数据点的 csv 文件的路径.

"TRAIN_FILES" is a matrix (e.g. pandas dataframe) where the first column is the label of a data point and the second column is the path to the csv file containing the data point.

这篇关于如何使用 tf.data 和 map 批量加载 CSV 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆