Tensorflow 2中的fit方法中使用Dataset和ndarray有什么区别? [英] What's the difference between using Dataset and ndarray in fit method in Tensorflow 2?
问题描述
作为TF的新手,我对在训练模型时使用BatchDataset感到有些困惑.
As a newbie for TF, I feel a little confused about the usage of BatchDataset in training a model.
让我们以MNIST为例.在此分类任务中,我们可以加载数据并将x_trian,y_train的ndarray直接输入模型.
Let's use the MNIST as an example. In this classification task, we can load the data and feed the ndarray of x_trian, y_train directly into the model.
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train,y_train, epochs=5)
培训结果为:
Epoch 1/5
2021-02-17 15:43:02.621749: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll
1/1875 [..............................] - ETA: 0s - loss: 2.2977 - accuracy: 0.0938WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0000s vs `on_train_batch_end` time: 0.0010s). Check your callbacks.
1875/1875 [==============================] - 2s 1ms/step - loss: 0.3047 - accuracy: 0.9117
Epoch 2/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.1473 - accuracy: 0.9569
Epoch 3/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.1097 - accuracy: 0.9673
Epoch 4/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0905 - accuracy: 0.9724
Epoch 5/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0759 - accuracy: 0.9764
我们还可以使用tf.data.Dataset.from_tensor_slices生成BatchDataset并将其输入以适合函数.
And we can also use tf.data.Dataset.from_tensor_slices to generate a BatchDataset and feed it in to fit function.
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
train_ds = tf.data.Dataset.from_tensor_slices(
(x_train, y_train)).shuffle(10000).batch(32)
test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(32)
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(train_ds, epochs=5)
培训过程中的结果如下.
The results in training process is as follows.
Epoch 1/5
2021-02-17 15:30:34.698718: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll
1875/1875 [==============================] - 3s 1ms/step - loss: 0.2969 - accuracy: 0.9140
Epoch 2/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.1462 - accuracy: 0.9566
Epoch 3/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.1087 - accuracy: 0.9669
Epoch 4/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.0881 - accuracy: 0.9730
Epoch 5/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.0765 - accuracy: 0.9759
可以使用2种方法成功训练模型,但是它们之间有什么区别吗?使用数据集进行培训是否还有其他优势?如果在这种情况下这两种方法之间没有区别,那么生成用于训练的数据集的典型用法是什么?何时应使用此方法?
The model can be trained successfully with 2 methods, but is there any difference between them? Does using Dataset for training have some additional advantages? If there is no difference between the 2 methods in this case, what the typical usage of generating a Dataset for training and when should this method be used?
谢谢.
推荐答案
当我们使用 Model.fit(x = None,y = None,...
-我们可以传递训练对参数作为纯 numpy
数组或 keras.utils.Sequence
或 tf.data
.
When we use Model.fit(x=None, y=None, ...
- we can pass the training pair argument as pure numpy
array or keras.utils.Sequence
or tf.data
.
当我们如下使用时,我们将每个训练对( x
和 y
)分别作为直接numpy数组传递给 fit
功能.
When we use as follows, we're passing each training pairs (x
and y
) separately as a direct numpy array to the fit
function.
# data
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
# fit
model.fit(x = x_train, y = y_train, ...
# check
print(x_train.shape, y_train.shape)
print(type(x_train), type(y_train))
# (60000, 28, 28) (60000,)
# <class 'numpy.ndarray'> <class 'numpy.ndarray'>
另一方面,在 tf.data
和 Sequence
中,我们将训练对作为元组的形状传递,而数据类型仍然是 ndarray 代码>.根据 doc ,
On the other hand in tf.data
and Sequence
we pass the training pairs as a shape of the tuple and still the data type are ndarray
. According to the doc,
- 一个
tf.data
数据集.应该返回以下任意一个的元组(inputs
,targets
) - 生成器或
keras.utils.Sequence
返回(输入
,目标
)
- A
tf.data
dataset. Should return a tuple of either (inputs
,targets
) - A generator or
keras.utils.Sequence
returning (inputs
,targets
)
即
# data
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(10000).batch(2)
# check
next(iter(train_ds))
(<tf.Tensor: shape=(2, 28, 28), dtype=uint8, numpy= array([[[...], [[...]]], dtype=uint8)>,
<tf.Tensor: shape=(2,), dtype=uint8, numpy=array([7, 8], dtype=uint8)>)
这就是为什么如果 x
是 tf.data
, generator
或 keras.utils.Sequence
实例,不应该指定 y
(因为将从 x
获得目标).
And that's why, if x
is a tf.data
, generator
, or keras.utils.Sequence
instance, y
should not be specified (since targets will be obtained from x
).
# fit
model.fit(train_ds, ...
在这三个 tf.data
数据管道中,最有效的方法是 generator
.当数据集足够小时,将首先选择第一种方法( x
和 y
).但是,当数据集足够大时,您会考虑使用 tf.data
或 generator
来实现有效的输入管道.因此,这些选择完全取决于.
Among these three, tf.data
data pipelines is the most efficient approach followed by generator
. When the data set is small enough, the first approach (x
and y
) is primarily chosen. But when the dataset gets big enough, then you would think about tf.data
or generator
for efficient input pipelines. So the choice of these totally depends.
来自Keras的帖子:
-
NumPy数组,就像Scikit-Learn和许多其他基于Python的库一样.如果您的数据适合存储在内存中,这是一个不错的选择.
NumPy arrays, just like Scikit-Learn and many other Python-based libraries. This is a good option if your data fits in memory.
TensorFlow数据集对象 .这是一个高性能的选项,它更适用于内存不足且从磁盘或分布式文件系统流式传输的数据集.
TensorFlow Dataset objects. This is a high-performance option that is more suitable for datasets that do not fit in memory and that are streamed from disk or from a distributed filesystem.
Python生成器,可生成大量数据(例如keras.utils.Sequence类的自定义子类).
Python generators that yield batches of data (such as custom subclasses of the keras.utils.Sequence class).
这篇关于Tensorflow 2中的fit方法中使用Dataset和ndarray有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!