Tensorflow 2中的fit方法中使用Dataset和ndarray有什么区别? [英] What's the difference between using Dataset and ndarray in fit method in Tensorflow 2?

查看:111
本文介绍了Tensorflow 2中的fit方法中使用Dataset和ndarray有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为TF的新手,我对在训练模型时使用BatchDataset感到有些困惑.

As a newbie for TF, I feel a little confused about the usage of BatchDataset in training a model.

让我们以MNIST为例.在此分类任务中,我们可以加载数据并将x_trian,y_train的ndarray直接输入模型.

Let's use the MNIST as an example. In this classification task, we can load the data and feed the ndarray of x_trian, y_train directly into the model.

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train,y_train, epochs=5)

培训结果为:

Epoch 1/5
2021-02-17 15:43:02.621749: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll
   1/1875 [..............................] - ETA: 0s - loss: 2.2977 - accuracy: 0.0938WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0000s vs `on_train_batch_end` time: 0.0010s). Check your callbacks.
1875/1875 [==============================] - 2s 1ms/step - loss: 0.3047 - accuracy: 0.9117
Epoch 2/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.1473 - accuracy: 0.9569
Epoch 3/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.1097 - accuracy: 0.9673
Epoch 4/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0905 - accuracy: 0.9724
Epoch 5/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0759 - accuracy: 0.9764

我们还可以使用tf.data.Dataset.from_tensor_slices生成BatchDataset并将其输入以适合函数.

And we can also use tf.data.Dataset.from_tensor_slices to generate a BatchDataset and feed it in to fit function.

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

train_ds = tf.data.Dataset.from_tensor_slices(
    (x_train, y_train)).shuffle(10000).batch(32)

test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(32)

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(train_ds, epochs=5)

培训过程中的结果如下.

The results in training process is as follows.

Epoch 1/5
2021-02-17 15:30:34.698718: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll
1875/1875 [==============================] - 3s 1ms/step - loss: 0.2969 - accuracy: 0.9140
Epoch 2/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.1462 - accuracy: 0.9566
Epoch 3/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.1087 - accuracy: 0.9669
Epoch 4/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.0881 - accuracy: 0.9730
Epoch 5/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.0765 - accuracy: 0.9759

可以使用2种方法成功训练模型,但是它们之间有什么区别吗?使用数据集进行培训是否还有其他优势?如果在这种情况下这两种方法之间没有区别,那么生成用于训练的数据集的典型用法是什么?何时应使用此方法?

The model can be trained successfully with 2 methods, but is there any difference between them? Does using Dataset for training have some additional advantages? If there is no difference between the 2 methods in this case, what the typical usage of generating a Dataset for training and when should this method be used?

谢谢.

推荐答案

当我们使用 Model.fit(x = None,y = None,... -我们可以传递训练对参数作为纯 numpy 数组或 keras.utils.Sequence tf.data .

When we use Model.fit(x=None, y=None, ... - we can pass the training pair argument as pure numpy array or keras.utils.Sequence or tf.data.

当我们如下使用时,我们将每个训练对( x y )分别作为直接numpy数组传递给 fit 功能.

When we use as follows, we're passing each training pairs (x and y) separately as a direct numpy array to the fit function.

# data 
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()

# fit
model.fit(x = x_train, y = y_train, ... 

# check
print(x_train.shape, y_train.shape)
print(type(x_train), type(y_train))

# (60000, 28, 28) (60000,)
# <class 'numpy.ndarray'> <class 'numpy.ndarray'>

另一方面,在 tf.data Sequence 中,我们将训练对作为元组的形状传递,而数据类型仍然是 ndarray .根据 doc

On the other hand in tf.data and Sequence we pass the training pairs as a shape of the tuple and still the data type are ndarray. According to the doc,

  • 一个 tf.data 数据集.应该返回以下任意一个的元组( inputs targets )
  • 生成器或 keras.utils.Sequence 返回(输入目标)
  • A tf.data dataset. Should return a tuple of either (inputs, targets)
  • A generator or keras.utils.Sequence returning (inputs, targets)

# data
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(10000).batch(2)

# check
next(iter(train_ds))

(<tf.Tensor: shape=(2, 28, 28), dtype=uint8, numpy= array([[[...], [[...]]], dtype=uint8)>,
 <tf.Tensor: shape=(2,), dtype=uint8, numpy=array([7, 8], dtype=uint8)>)

这就是为什么如果 x tf.data generator keras.utils.Sequence 实例,不应该指定 y (因为将从 x 获得目标).

And that's why, if x is a tf.data, generator, or keras.utils.Sequence instance, y should not be specified (since targets will be obtained from x).

# fit 
model.fit(train_ds, ...

在这三个 tf.data 数据管道中,最有效的方法是 generator .当数据集足够小时,将首先选择第一种方法( x y ).但是,当数据集足够大时,您会考虑使用 tf.data generator 来实现有效的输入管道.因此,这些选择完全取决于.

Among these three, tf.data data pipelines is the most efficient approach followed by generator. When the data set is small enough, the first approach (x and y) is primarily chosen. But when the dataset gets big enough, then you would think about tf.data or generator for efficient input pipelines. So the choice of these totally depends.

来自Keras的帖子:

  • NumPy数组,就像Scikit-Learn和许多其他基于Python的库一样.如果您的数据适合存储在内存中,这是一个不错的选择.

  • NumPy arrays, just like Scikit-Learn and many other Python-based libraries. This is a good option if your data fits in memory.

TensorFlow数据集对象 .这是一个高性能的选项,它更适用于内存不足且从磁盘或分布式文件系统流式传输的数据集.

TensorFlow Dataset objects. This is a high-performance option that is more suitable for datasets that do not fit in memory and that are streamed from disk or from a distributed filesystem.

Python生成器,可生成大量数据(例如keras.utils.Sequence类的自定义子类).

Python generators that yield batches of data (such as custom subclasses of the keras.utils.Sequence class).

这篇关于Tensorflow 2中的fit方法中使用Dataset和ndarray有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆