为什么在 Keras 训练期间由 model.evaluate() 计算的指标与跟踪的指标不同? [英] Why differ metrics calculated by model.evaluate() from tracked metrics during training in Keras?

查看:30
本文介绍了为什么在 Keras 训练期间由 model.evaluate() 计算的指标与跟踪的指标不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将 Keras 2.0.4(TensorFlow 后端)用于图像分类任务(基于预训练模型).在训练/调整期间,我使用 CSVLogger 跟踪所有使用的指标(例如 categorical_accuracycategorical crossentropy) - 包括与验证集关联的相应指标(即val_categorical_accuracyval_categorical_crossentropy).

I am using Keras 2.0.4 (TensorFlow backend) for an image classification task (based on pretrained models). During training/tuning I track all used metrics (e.g. categorical_accuracy, categorical crossentropy) with CSVLogger - including the corresponding metrics being associated with the validation set (i.e. val_categorical_accuracy, val_categorical_crossentropy).

通过回调ModelCheckpoint,我正在跟踪权重的最佳配置(save_best_only=True).为了在验证集上评估模型,我使用 model.evaluate().

With the callback ModelCheckpoint I am tracking the best configuration of weights (save_best_only=True). In order to evaluate the model on the validation set I use model.evaluate().

我的期望是:CSVLogger(最佳"时期)跟踪的指标等于 model.evaluate() 计算的指标.不幸的是,这种情况并非如此.指标相差 +- 5%.这种行为有原因吗?

My expectation is: tracked metrics by CSVLogger (of 'best' epoch) equal the metrics calculated by model.evaluate(). Unfortunately this is NOT the case. Metrics differ by +- 5%. Is there a reason for this behavior?

经过一些测试,我可以获得一些见解:

After some testing I could gain some insights:

  1. 如果我不使用生成器来训练和验证数据(因此也没有 model.fit_generator()),问题就不会发生.--> 使用 ImageDataGenerator 进行训练和验证数据是差异的根源.(请注意,对于 evaluate 的计算,我使用生成器,但我确实使用相同的验证数据(至少如果 DataImageGenerator 会按预期工作...).
    我认为,ImageDataGenerator 不能正常工作(请,也可以看看这个).
  2. 如果我根本不使用生成器,就不会有这个问题.CSVLogger(最佳"时期)跟踪的指标与 model.evaluate() 计算的指标相同.
    有趣的是,还有一个问题:如果使用相同的数据进行训练和验证,训练指标(例如 loss)和验证指标(例如 val_loss)之间会存在差异) 在每个时代结束时.
    (类似问题)
  1. If I don't use a generator for training and validation data (and therefore no model.fit_generator()), the problem doesn't occur. --> Using the ImageDataGenerator for training and validation data is the source of the discrepancy. (Please note, for calculation of evaluate I don't use a generator, but I do use the same validation data (at least if DataImageGenerator would work as expected...).
    I think, the ImageDataGenerator doesn't work as it should (please, also have a look at this).
  2. If I use no generators at all, there won't be this problem. Id est tracked metrics by CSVLogger (of 'best' epoch) equal the metrics calculated by model.evaluate().
    Interestingly, there is another problem: if you use the same data for training and validation, there will be a discrepancy between training metrics (e.g. loss) and validation metrics (e.g. val_loss) at the end of each epoch.
    (A similar problem)

使用的代码:

############################ import section ############################
from __future__ import print_function # perform like in python 3.x
from keras.datasets import mnist
from keras.utils import np_utils # numpy utils for to_categorical()
from keras.models import Model, load_model
from keras.layers import Dense, GlobalAveragePooling2D, Dropout, GaussianDropout, Conv2D, MaxPooling2D
from keras.optimizers import SGD, Adam
from keras import backend as K
from keras.preprocessing.image import ImageDataGenerator 
from keras import metrics
import os
import sys
from scipy import misc
import numpy as np
from keras.applications.vgg16 import preprocess_input as vgg16_preprocess_input
from keras.applications import VGG16
from keras.callbacks import CSVLogger, ModelCheckpoint


############################ manual settings ###########################
# general settings
seed = 1337

loss_function = 'categorical_crossentropy'

learning_rate = 0.001

epochs = 10

batch_size = 20

nb_classes = 5 

img_width, img_height = 400, 400 # >= 48 necessary, as VGG16 is used

chosen_optimizer = SGD(lr=learning_rate, momentum=0.0, decay=0.0, nesterov=False)

steps_per_epoch = 40 // batch_size  # 40 train samples in 5 classes
validation_steps = 40 // batch_size # 40 train samples in 5 classes

data_dir = # TODO: set path where data is stored (folders: 'train', 'val', 'test'; within each folder are folders named by classes)

# callbacks: CSVLogger & ModelCheckpoint
filepath = # TODO: set path, where you want to store files generated by the callbacks
file_best_checkpoint= 'best_epoch.hdf5'
file_csvlogger = 'logged_metrics.txt'

modelcheckpoint_best_epoch= ModelCheckpoint(filepath=os.path.join(filepath, file_best_checkpoint), 
                                  monitor = 'val_loss' , verbose = 1, 
                                  save_best_only = True, 
                                  save_weights_only=False, mode='auto', 
                                  period=1) # every epoch executed
csvlogger = CSVLogger(os.path.join(filepath, file_csvlogger) , separator=',', append=False)



############################ prepare data ##############################
# get validation data (for evaluation)
X_val, Y_val = # TODO: load train data (4darray, samples, img_width, img_height, nb_channels) IMPORTANT: 5 classes with 8 images each.

# preprocess data
my_preprocessing_function = mf.my_vgg16_preprocess_input

# 'augmentation' configuration we will use for training
train_datagen = ImageDataGenerator(preprocessing_function = my_preprocessing_function) # only preprocessing; static data set

# 'augmentation' configuration we will use for validation
val_datagen = ImageDataGenerator(preprocessing_function = my_preprocessing_function) # only preprocessing; static data set

train_data_dir = os.path.join(data_dir, 'train')
validation_data_dir = os.path.join(data_dir, 'val')
train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    shuffle = True,
    seed = seed, # random seed for shuffling and transformations
    class_mode='categorical')  # label type (categorical = one-hot vector)

validation_generator = val_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    shuffle = True,
    seed = seed, # random seed for shuffling and transformations
    class_mode='categorical')  # label type (categorical = one-hot vector)



############################## training ###############################
print("
---------------------------------------------------------------")
print("------------------------ training model -----------------------")
print("---------------------------------------------------------------")
# create the base pre-trained model
base_model = VGG16(include_top=False, weights = None, input_shape=(img_width, img_height, 3), pooling = 'max', classes = nb_classes)
model_name =  "VGG_modified"

# do not freeze any layers --> all layers trainable
for layer in base_model.layers:
    layer.trainable = True

# define topping of base_model
x = base_model.output # get the last layer of our base_model
x = Dense(1024, activation='relu', name='fc1')(x)
x = Dense(1024, activation='relu', name='fc2')(x)
predictions = Dense(nb_classes, activation='softmax', name='predictions')(x)

# finally, stack model together
model = Model(outputs=predictions, name= model_name, inputs=base_model.input) #Keras 1.x.x: model = Model(input=base_model.input, output=predictions) 
print(model.summary())

# compile the model (should be done *after* setting layers to non-trainable)
model.compile(optimizer = chosen_optimizer, loss=loss_function, 
            metrics=['categorical_accuracy','kullback_leibler_divergence'])

# train the model on your data
model.fit_generator(
    train_generator,
    steps_per_epoch=steps_per_epoch,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=validation_steps,
    callbacks = [csvlogger, modelcheckpoint_best_epoch])



############################## evaluation ##############################
print("

---------------------------------------------------------------")
print("------------------ Evaluation of Best Epoch -------------------")
print("---------------------------------------------------------------")
# load model (corresponding to best training epoch)
model = load_model(os.path.join(filepath, file_best_checkpoint))

# evaluate model on validation data (in test mode!)
list_of_metrics = model.evaluate(X_val, Y_val, batch_size=batch_size, verbose=1, sample_weight=None)
index = 0
print('
Metrics:')
for metric in model.metrics_names:
    print(metric+ ':' , str(list_of_metrics[index]))
    index += 1

<小时>

编辑 2
参考 1. of E D I T:如果我在训练和评估期间对验证数据使用相同的生成器(通过使用 evaluate_generator()),问题仍然存在.因此,这肯定是由生成器引起的问题...


E D I T 2
Referring to 1. of E D I T: If I use the same generator for validation data during training and evaluation (by using evaluate_generator()), the problem still occurs. Hence, it is definitely a problem caused by the generators...

推荐答案

仅适用于验证数据集上的指标评估.

It will be the case only for the evalutation of the metrics on the validation dataset.

在训练期间在训练数据集上计算的指标并不能反映模型在 epoch 结束时的真实指标,因为模型将在每个批次中更新(修改).

The metrics computed on the training dataset during training do not reflect the real metrics of the model at then end of the epoch as the model will be updated (modified) at each single batch.

这有帮助吗?

这篇关于为什么在 Keras 训练期间由 model.evaluate() 计算的指标与跟踪的指标不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆