未从改组后的数据集中选择Keras ImageDataGenerator验证拆分 [英] Keras ImageDataGenerator validation split not selected from shuffled dataset

查看:208
本文介绍了未从改组后的数据集中选择Keras ImageDataGenerator验证拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何将我的图像数据集随机分为训练和验证日期集?更具体地说,Keras ImageDataGenerator函数中的validation_split参数不是将我的图像随机分为训练和验证,而是从未经改组的数据集中切片验证样本.

How can I randomly split my image dataset into training and validation datesets? More specifically, the validation_split argument in Keras ImageDataGenerator function is not randomly splitting my images into training and validation but is slicing the validation sample from an unshuffled dataset.

推荐答案

在Keras的ImageDataGenerator中指定validation_split参数时,将在对数据进行混洗之前执行拆分,以便仅获取最后的x个样本.问题在于,最后一个被选作验证的数据样本可能无法代表训练数据,因此可能会失败.当您的图像数据存储在公共目录中且每个子文件夹均由class命名时,这是一个特别常见的死角.已在几篇文章中提到了:

When specifying the validation_split argument in Keras' ImageDataGenerator the split is performed before the data is shuffled such that only the last x samples are taken. The issue is that the last sample of data selected as validation may not be representative of the training data and so it can fail. This is an especially common dead end when your image data is stored in a common directory with each sub-folder named by class. The has been noted in several posts:

选择随机验证数据集

正如您所提到的,Keras只是获取数据集的最后x个样本,因此,如果您要继续使用它,则需要提前对数据集进行洗牌.

As you mentioned, Keras simply takes the last x samples of the dataset, so if you want to keep using it, you need to shuffle your dataset in advance.

训练准确性很高,而验证准确性却很低?

请在训练之前检查您是否对数据进行了混洗.由于在喀拉拉邦中进行的验证拆分是在洗牌之前执行的,因此也许您选择了不平衡数据集作为验证集,所以准确性较低.

please check if you have shuffled the data before training. Because the validation splitting in keras is performed before shuffle, so maybe you have chosen an unbalanced dataset as your validation set, thus you got the low accuracy.

验证拆分"是否随机选择验证样本?

验证数据被选为输入的最后10%(例如,如果validation_split = 0.9).训练数据(其余部分)可以选择在每个纪元(适合的shuffle参数)进行shuffle.显然,这并不会影响验证数据,因为每个纪元之间的设置都必须相同.

The validation data is picked as the last 10% (for instance, if validation_split=0.9) of the input. The training data (the remainder) can optionally be shuffled at every epoch (shuffle argument in fit). That doesn't affect the validation data, obviously, it has to be the same set from epoch to epoch.

此答案指出了sklearn train_test_split()作为解决方案,但我想提出一种保持一致性的不同解决方案在keras工作流程中.

This answer points to the sklearn train_test_split() as a solution, but I want to propose a different solution that keeps consistency in the keras workflow.

使用 split-folders 软件包,您可以将主数据目录随机分为训练内容,验证和测试(或只是培训和验证)目录.特定于类的子文件夹将自动复制.

With the split-folders package you can randomly split your main data directory into training, validation, and testing (or just training and validation) directories. The class-specific subfolders are automatically copied.

输入文件夹应采用以下格式:

The input folder shoud have the following format:

input/
    class1/
        img1.jpg
        img2.jpg
        ...
    class2/
        imgWhatever.jpg
        ...
    ...

为了给你这个:

output/
    train/
        class1/
            img1.jpg
            ...
        class2/
            imga.jpg
            ...
    val/
        class1/
            img2.jpg
            ...
        class2/
            imgb.jpg
            ...
    test/            # optional
        class1/
            img3.jpg
            ...
        class2/
            imgc.jpg
            ...

从文档中:

import split_folders

# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
split_folders.ratio('input_folder', output="output", seed=1337, ratio=(.8, .1, .1)) # default values

# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
split_folders.fixed('input_folder', output="output", seed=1337, fixed=(100, 100), oversample=False) # default values

通过这种新的文件夹结构,您可以轻松地使用keras数据生成器将数据分为训练和验证,并最终训练模型.

With this new folder arrangement you can easily use keras data generators to divide your data into training and validation and eventually train your model.

import tensorflow as tf
import split_folders
import os

main_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/Data'
output_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/output'

split_folders.ratio(main_dir, output=output_dir, seed=1337, ratio=(.7, .3))

train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    rescale=1./224)

train_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'train'),
                                                    class_mode='categorical',
                                                    batch_size=32,
                                                    target_size=(224,224),
                                                    shuffle=True)

validation_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'val'),
                                                        target_size=(224, 224),
                                                        batch_size=32,
                                                        class_mode='categorical',
                                                        shuffle=True) # set as validation data

base_model = tf.keras.applications.ResNet50V2(
    input_shape=IMG_SHAPE,
    include_top=False,
    weights=None)

maxpool_layer = tf.keras.layers.GlobalMaxPooling2D()
prediction_layer = tf.keras.layers.Dense(4, activation='softmax')

model = tf.keras.Sequential([
    base_model,
    maxpool_layer,
    prediction_layer
])

opt = tf.keras.optimizers.Adam(lr=0.004)
model.compile(optimizer=opt,
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=['accuracy'])

model.fit(
    train_generator,
    steps_per_epoch = train_generator.samples // 32,
    validation_data = validation_generator,
    validation_steps = validation_generator.samples // 32,
    epochs = 20)

这篇关于未从改组后的数据集中选择Keras ImageDataGenerator验证拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆