使用keras进行K折交叉验证 [英] K fold cross validation using keras

查看:220
本文介绍了使用keras进行K折交叉验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于神经网络的巨大运行时间,卷积网络中的k倍交叉验证似乎并未受到重视.我的数据集很小,我有兴趣使用此处.是否有可能?谢谢.

It seems that k-fold cross validation in convn net is not taken seriously due to huge running time of the neural network. I have a small data-set and I am interested in doing k-fold cross validation using the example given here. Is it possible? Thanks.

推荐答案

如果将图像与数据生成器一起使用,这是使用Keras和scikit-learn进行10倍交叉验证的一种方法.策略是根据每次折叠将文件复制到trainingvalidationtest子文件夹.

If you are using images with data generators, here's one way to do 10-fold cross-validation with Keras and scikit-learn. The strategy is to copy the files to training, validation, and test subfolders according to each fold.

import numpy as np
import os
import pandas as pd
import shutil
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# used to copy files according to each fold
def copy_images(df, directory):
    destination_directory = "{path to your data directory}/" + directory
    print("copying {} files to {}...".format(directory, destination_directory))

    # remove all files from previous fold
    if os.path.exists(destination_directory):
        shutil.rmtree(destination_directory)

    # create folder for files from this fold
    if not os.path.exists(destination_directory):
        os.makedirs(destination_directory)

    # create subfolders for each class
    for c in set(list(df['class'])):
        if not os.path.exists(destination_directory + '/' + c):
            os.makedirs(destination_directory + '/' + c)

    # copy files for this fold from a directory holding all the files
    for i, row in df.iterrows():
        try:
            # this is the path to all of your images kept together in a separate folder
            path_from = "{path to all of your images}"
            path_from = path_from + "{}.jpg"
            path_to = "{}/{}".format(destination_directory, row['class'])

            # move from folder keeping all files to training, test, or validation folder (the "directory" argument)
            shutil.copy(path_from.format(row['filename']), path_to)
        except Exception, e:
            print("Error when copying {}: {}".format(row['filename'], str(e)))

# dataframe containing the filenames of the images (e.g., GUID filenames) and the classes
df = pd.read_csv('{path to your data}.csv')
df_y = df['class']
df_x = df
del df_x['class']

skf = StratifiedKFold(n_splits = 10)
total_actual = []
total_predicted = []
total_val_accuracy = []
total_val_loss = []
total_test_accuracy = []

for i, (train_index, test_index) in enumerate(skf.split(df_x, df_y)):
    x_train, x_test = df_x.iloc[train_index], df_x.iloc[test_index]
    y_train, y_test = df_y.iloc[train_index], df_y.iloc[test_index]

    train = pd.concat([x_train, y_train], axis=1)
    test = pd.concat([x_test, y_test], axis = 1)

    # take 20% of the training data from this fold for validation during training
    validation = train.sample(frac = 0.2)

    # make sure validation data does not include training data
    train = train[~train['filename'].isin(list(validation['filename']))]

    # copy the images according to the fold
    copy_images(train, 'training')
    copy_images(validation, 'validation')
    copy_images(test, 'test')

    print('**** Running fold '+ str(i))

    # here you call a function to create and train your model, returning validation accuracy and validation loss
    val_accuracy, val_loss = create_train_model();

    # append validation accuracy and loss for average calculation later on
    total_val_accuracy.append(val_accuracy)
    total_val_loss.append(val_loss)

    # here you will call a predict() method that will predict the images on the "test" subfolder 
    # this function returns the actual classes and the predicted classes in the same order
    actual, predicted = predict()

    # append accuracy from the predictions on the test data
    total_test_accuracy.append(accuracy_score(actual, predicted))

    # append all of the actual and predicted classes for your final evaluation
    total_actual = total_actual + actual
    total_predicted = total_predicted + predicted

    # this is optional, but you can also see the performance on each fold as the process goes on
    print(classification_report(total_actual, total_predicted))
    print(confusion_matrix(total_actual, total_predicted))

print(classification_report(total_actual, total_predicted))
print(confusion_matrix(total_actual, total_predicted))
print("Validation accuracy on each fold:")
print(total_val_accuracy)
print("Mean validation accuracy: {}%".format(np.mean(total_val_accuracy) * 100))

print("Validation loss on each fold:")
print(total_val_loss)
print("Mean validation loss: {}".format(np.mean(total_val_loss)))

print("Test accuracy on each fold:")
print(total_test_accuracy)
print("Mean test accuracy: {}%".format(np.mean(total_test_accuracy) * 100))

在predict()函数中,如果您使用的是数据生成器,那么在测试时我发现保持预测顺序不变的唯一方法是使用batch_size1:

In your predict() function, if you are using a data generator, the only way I could find to keep the predictions in the same order when testing was to use a batch_size of 1:

generator = ImageDataGenerator().flow_from_directory(
        '{path to your data directory}/test',
        target_size = (img_width, img_height),
        batch_size = 1,
        color_mode = 'rgb',
        # categorical for a multiclass problem
        class_mode = 'categorical',
        # this will also ensure the same order
        shuffle = False)

使用此代码,我能够使用数据生成器进行10倍交叉验证(因此,我不必将所有文件都保留在内存中).如果您有数百万个图像,这可能会花费很多工作,而如果测试集很大,则batch_size = 1可能会成为瓶颈,但是对于我的项目而言,效果很好.

With this code, I was able to do 10-fold cross-validation using data generators (so I did not have to keep all files in memory). This can be a lot of work if you have millions of images and the batch_size = 1 could be a bottleneck if your test set is large, but for my project this worked well.

这篇关于使用keras进行K折交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆