未从改组后的数据集中选择Keras ImageDataGenerator验证拆分 [英] Keras ImageDataGenerator validation split not selected from shuffled dataset
问题描述
如何将我的图像数据集随机分为训练和验证日期集?更具体地说,Keras ImageDataGenerator
函数中的validation_split
参数不是将我的图像随机分为训练和验证,而是从未经改组的数据集中切片验证样本.
How can I randomly split my image dataset into training and validation datesets? More specifically, the validation_split
argument in Keras ImageDataGenerator
function is not randomly splitting my images into training and validation but is slicing the validation sample from an unshuffled dataset.
推荐答案
在Keras的ImageDataGenerator
中指定validation_split
参数时,将在对数据进行混洗之前执行拆分,以便仅获取最后的x个样本.问题在于,最后一个被选作验证的数据样本可能无法代表训练数据,因此可能会失败.当您的图像数据存储在公共目录中且每个子文件夹均由class命名时,这是一个特别常见的死角.已在几篇文章中提到了:
When specifying the validation_split
argument in Keras' ImageDataGenerator
the split is performed before the data is shuffled such that only the last x samples are taken. The issue is that the last sample of data selected as validation may not be representative of the training data and so it can fail. This is an especially common dead end when your image data is stored in a common directory with each sub-folder named by class. The has been noted in several posts:
正如您所提到的,Keras只是获取数据集的最后x个样本,因此,如果您要继续使用它,则需要提前对数据集进行洗牌.
As you mentioned, Keras simply takes the last x samples of the dataset, so if you want to keep using it, you need to shuffle your dataset in advance.
请在训练之前检查您是否对数据进行了混洗.由于在喀拉拉邦中进行的验证拆分是在洗牌之前执行的,因此也许您选择了不平衡数据集作为验证集,所以准确性较低.
please check if you have shuffled the data before training. Because the validation splitting in keras is performed before shuffle, so maybe you have chosen an unbalanced dataset as your validation set, thus you got the low accuracy.
验证数据被选为输入的最后10%(例如,如果validation_split = 0.9).训练数据(其余部分)可以选择在每个纪元(适合的shuffle参数)进行shuffle.显然,这并不会影响验证数据,因为每个纪元之间的设置都必须相同.
The validation data is picked as the last 10% (for instance, if validation_split=0.9) of the input. The training data (the remainder) can optionally be shuffled at every epoch (shuffle argument in fit). That doesn't affect the validation data, obviously, it has to be the same set from epoch to epoch.
此答案指出了sklearn train_test_split()
作为解决方案,但我想提出一种保持一致性的不同解决方案在keras工作流程中.
This answer points to the sklearn train_test_split()
as a solution, but I want to propose a different solution that keeps consistency in the keras workflow.
使用 split-folders 软件包,您可以将主数据目录随机分为训练内容,验证和测试(或只是培训和验证)目录.特定于类的子文件夹将自动复制.
With the split-folders package you can randomly split your main data directory into training, validation, and testing (or just training and validation) directories. The class-specific subfolders are automatically copied.
输入文件夹应采用以下格式:
The input folder shoud have the following format:
input/
class1/
img1.jpg
img2.jpg
...
class2/
imgWhatever.jpg
...
...
为了给你这个:
output/
train/
class1/
img1.jpg
...
class2/
imga.jpg
...
val/
class1/
img2.jpg
...
class2/
imgb.jpg
...
test/ # optional
class1/
img3.jpg
...
class2/
imgc.jpg
...
从文档中:
import split_folders
# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
split_folders.ratio('input_folder', output="output", seed=1337, ratio=(.8, .1, .1)) # default values
# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
split_folders.fixed('input_folder', output="output", seed=1337, fixed=(100, 100), oversample=False) # default values
通过这种新的文件夹结构,您可以轻松地使用keras数据生成器将数据分为训练和验证,并最终训练模型.
With this new folder arrangement you can easily use keras data generators to divide your data into training and validation and eventually train your model.
import tensorflow as tf
import split_folders
import os
main_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/Data'
output_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/output'
split_folders.ratio(main_dir, output=output_dir, seed=1337, ratio=(.7, .3))
train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
rescale=1./224)
train_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'train'),
class_mode='categorical',
batch_size=32,
target_size=(224,224),
shuffle=True)
validation_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'val'),
target_size=(224, 224),
batch_size=32,
class_mode='categorical',
shuffle=True) # set as validation data
base_model = tf.keras.applications.ResNet50V2(
input_shape=IMG_SHAPE,
include_top=False,
weights=None)
maxpool_layer = tf.keras.layers.GlobalMaxPooling2D()
prediction_layer = tf.keras.layers.Dense(4, activation='softmax')
model = tf.keras.Sequential([
base_model,
maxpool_layer,
prediction_layer
])
opt = tf.keras.optimizers.Adam(lr=0.004)
model.compile(optimizer=opt,
loss=tf.keras.losses.CategoricalCrossentropy(),
metrics=['accuracy'])
model.fit(
train_generator,
steps_per_epoch = train_generator.samples // 32,
validation_data = validation_generator,
validation_steps = validation_generator.samples // 32,
epochs = 20)
这篇关于未从改组后的数据集中选择Keras ImageDataGenerator验证拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!