在顺序模型中将整数CSV数据馈送到Keras Dense第一层 [英] Feeding integer CSV data to a Keras Dense first layer in sequential model

查看:61
本文介绍了在顺序模型中将整数CSV数据馈送到Keras Dense第一层的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

CSV数据集的文档没有显示如何使用CSV数据集适用于使用数据训练神经网络等实际操作.任何人都可以提供一个简单的示例来演示如何做到这一点,至少要清楚地了解数据形状和类型问题,并且最好考虑分批处理,改组和在历元上重复吗?

The documentation for CSV Datasets stops short of showing how to use a CSV dataset for anything practical like using the data to train a neural network. Can anyone provide a straightforward example to demonstrate how to do this, with clarity around data shape and type issues at a minimum, and preferably considering batching, shuffling, repeating over epochs as well?

例如,我有一个M行的CSV文件,每行是一个整数类标签,后跟N整数,我希望使用老式的三层神经网络从中预测类标签, H隐藏的神经元:

For example, I have a CSV file of M rows, each row being an integer class label followed by N integers from which I hope to predict the class label using an old-style 3-layer neural network with H hidden neurons:

model = Sequential()
model.add(Dense(H, activation='relu', input_dim=N))
...
model.fit(train_ds, ...)

对于我的数据,M > 50000N > 200.我尝试使用以下方法创建数据集:

For my data, M > 50000 and N > 200. I have tried creating my dataset by using:

train_ds = tf.data.experimental.make_csv_dataset('mydata.csv`, batch_size=B)

但是...这导致数据集和模型之间的兼容性问题...但是不清楚这些兼容性问题在哪里-它们是输入形状,整数(不是浮点型)数据还是其他地方?

However... this leads to compatibility problems between the dataset and the model... but it's not clear where these compatibility problems lie - are they in the input shape, the integer (not float) data, or somewhere else?

推荐答案

该问题可能会提供一些帮助...尽管答案大多与Tensorflow V1.x

This question may provide some help... although the answers mostly relate to Tensorflow V1.x

此任务可能不需要CSV数据集.您指示的数据大小可能会适合内存,并且tf.data.Dataset可能会使数据包装起来比有价值的功能复杂得多.只要您的所有数据都是整数,就可以在没有数据集的情况下进行操作(如下所示).

It may be that CSV Datasets are not required for this task. The data size you indicate will probably fit in memory, and a tf.data.Dataset may wrap your data in more complexity than valuable functionality. You can do it without datasets (as shown below) so long as ALL your data is integers.

如果您坚持使用CSV数据集方法,请了解有多种使用CSV的方法,以及不同的加载方法(例如,请参见此处).由于CSV可以具有多种列类型(数字,布尔,文本,类别等),因此第一步通常是以面向列的格式加载CSV数据.这样可以通过其标签访问列-对预处理很有用.但是,您可能希望向模型提供数据行,因此从列转换为行可能是造成混乱的原因之一.在某些时候,您可能需要将整数数据转换为浮点数,但这可能是某些预处理的副作用.

If you persist with the CSV Dataset approach, understand that there are many ways CSVs are used, and different approaches to loading them (e.g. see here and here). Because CSVs can have a variety of column types (numerical, boolean, text, categorical, ...), the first step is usually to load the CSV data in a column-oriented format. This provides access to the columns via their labels - useful for pre-processing. However, you probably want to provide rows of data to your model, so translating from columns to rows may be one source of confusion. At some point you will probably need to convert your integer data to float, but this may occur as a side-effect of certain pre-processing.

只要您的CSV仅包含整数,而不会丢失数据并且具有标题行,则可以按以下步骤逐步操作,而无需tf.data.Dataset:

So long as your CSVs contain integers only, without missing data, and with a header row, you can do it without a tf.data.Dataset, step-by-step as follows:

import numpy as np
from numpy import genfromtxt
import tensorflow as tf

train_data = genfromtxt('train set.csv', delimiter=',')
test_data = genfromtxt('test set.csv', delimiter=',')
train_data = np.delete(train_data, (0), axis=0)    # delete header row
test_data = np.delete(test_data, (0), axis=0)      # delete header row
train_labels = train_data[:,[0]]
test_labels = test_data[:,[0]]
train_labels = tf.keras.utils.to_categorical(train_labels)
# count labels used in training set; categorise test set on same basis
# even if test set only uses subset of categories learning in training
K = len(train_labels[ 0 ])
test_labels = tf.keras.utils.to_categorical(test_labels, K)
train_data = np.delete(train_data, (0), axis=1)    # delete label column
test_data = np.delete(test_data, (0), axis=1)      # delete label column
# Data will have been read in as float... but you may want scaling/normalization...
scale = lambda x: x/1000.0 - 500.0                 # change to suit
scale(train_data)
scale(test_data)

N_train = len(train_data[0])        # columns in training set
N_test = len(test_data[0])          # columns in test set
if N_train != N_test:
  print("Datasets have incompatible column counts: %d vs %d" % (N_train, N_test))
  exit()
M_train = len(train_data)           # rows in training set
M_test = len(test_data)             # rows in test set

print("Training data size: %d rows x %d columns" % (M_train, N_train))
print("Test set data size: %d rows x %d columns" % (M_test, N_test))
print("Training to predict %d classes" % (K))

model = Sequential()
model.add(Dense(H, activation='relu', input_dim=N_train))     # H not yet defined...
...
model.compile(...)
model.fit( train_data, train_labels, ... )    # see docs for shuffle, batch, etc
model.evaluate( test_data, test_labels )

这篇关于在顺序模型中将整数CSV数据馈送到Keras Dense第一层的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆