为什么在Keras训练期间,由model.evaluate()计算的指标与跟踪的指标不同? [英] Why differ metrics calculated by model.evaluate() from tracked metrics during training in Keras?

查看:225
本文介绍了为什么在Keras训练期间,由model.evaluate()计算的指标与跟踪的指标不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Keras 2.0.4(TensorFlow后端)进行图像分类任务(基于预先训练的模型). 在训练/调整期间,我使用CSVLogger跟踪所有使用的指标(例如categorical_accuracycategorical crossentropy)-包括与验证集相关的相应指标(即val_categorical_accuracyval_categorical_crossentropy).

I am using Keras 2.0.4 (TensorFlow backend) for an image classification task (based on pretrained models). During training/tuning I track all used metrics (e.g. categorical_accuracy, categorical crossentropy) with CSVLogger - including the corresponding metrics being associated with the validation set (i.e. val_categorical_accuracy, val_categorical_crossentropy).

使用回调ModelCheckpoint,我正在跟踪权重(save_best_only=True)的最佳配置.为了在验证集上评估模型,我使用model.evaluate().

With the callback ModelCheckpoint I am tracking the best configuration of weights (save_best_only=True). In order to evaluate the model on the validation set I use model.evaluate().

我的期望是:(最佳时期的)CSVLogger跟踪的指标等于model.evaluate()计算的指标. 不幸的是,这种情况并非如此.指标相差+- 5%. 有这种现象的原因吗?

My expectation is: tracked metrics by CSVLogger (of 'best' epoch) equal the metrics calculated by model.evaluate(). Unfortunately this is NOT the case. Metrics differ by +- 5%. Is there a reason for this behavior?

E D I T:

经过一些测试,我可以获得一些见解:

After some testing I could gain some insights:

  1. 如果我不使用生成器来训练和验证数据(因此也没有model.fit_generator()),则不会发生此问题. ->使用ImageDataGenerator进行训练和验证数据是差异的根源. (请注意,对于evaluate的计算,我使用生成器,但我使用相同的验证数据(至少在DataImageGenerator可以使用预期...).
    我认为ImageDataGenerator无法正常工作(请 还可以查看).
  2. 如果我根本不使用任何生成器,就不会出现此问题. CSVLogger(最佳"时期的)ID跟踪的指标的度量值等于model.evaluate()计算的度量值.
    有趣的是,还有另一个问题:如果您使用相同的数据进行训练和验证,则在每个时期结束时,训练指标(例如loss)和验证指标(例如val_loss)之间会有差异.
    (类似的问题)
  1. If I don't use a generator for training and validation data (and therefore no model.fit_generator()), the problem doesn't occur. --> Using the ImageDataGenerator for training and validation data is the source of the discrepancy. (Please note, for calculation of evaluate I don't use a generator, but I do use the same validation data (at least if DataImageGenerator would work as expected...).
    I think, the ImageDataGenerator doesn't work as it should (please, also have a look at this).
  2. If I use no generators at all, there won't be this problem. Id est tracked metrics by CSVLogger (of 'best' epoch) equal the metrics calculated by model.evaluate().
    Interestingly, there is another problem: if you use the same data for training and validation, there will be a discrepancy between training metrics (e.g. loss) and validation metrics (e.g. val_loss) at the end of each epoch.
    (A similar problem)

使用的代码:

############################ import section ############################
from __future__ import print_function # perform like in python 3.x
from keras.datasets import mnist
from keras.utils import np_utils # numpy utils for to_categorical()
from keras.models import Model, load_model
from keras.layers import Dense, GlobalAveragePooling2D, Dropout, GaussianDropout, Conv2D, MaxPooling2D
from keras.optimizers import SGD, Adam
from keras import backend as K
from keras.preprocessing.image import ImageDataGenerator 
from keras import metrics
import os
import sys
from scipy import misc
import numpy as np
from keras.applications.vgg16 import preprocess_input as vgg16_preprocess_input
from keras.applications import VGG16
from keras.callbacks import CSVLogger, ModelCheckpoint


############################ manual settings ###########################
# general settings
seed = 1337

loss_function = 'categorical_crossentropy'

learning_rate = 0.001

epochs = 10

batch_size = 20

nb_classes = 5 

img_width, img_height = 400, 400 # >= 48 necessary, as VGG16 is used

chosen_optimizer = SGD(lr=learning_rate, momentum=0.0, decay=0.0, nesterov=False)

steps_per_epoch = 40 // batch_size  # 40 train samples in 5 classes
validation_steps = 40 // batch_size # 40 train samples in 5 classes

data_dir = # TODO: set path where data is stored (folders: 'train', 'val', 'test'; within each folder are folders named by classes)

# callbacks: CSVLogger & ModelCheckpoint
filepath = # TODO: set path, where you want to store files generated by the callbacks
file_best_checkpoint= 'best_epoch.hdf5'
file_csvlogger = 'logged_metrics.txt'

modelcheckpoint_best_epoch= ModelCheckpoint(filepath=os.path.join(filepath, file_best_checkpoint), 
                                  monitor = 'val_loss' , verbose = 1, 
                                  save_best_only = True, 
                                  save_weights_only=False, mode='auto', 
                                  period=1) # every epoch executed
csvlogger = CSVLogger(os.path.join(filepath, file_csvlogger) , separator=',', append=False)



############################ prepare data ##############################
# get validation data (for evaluation)
X_val, Y_val = # TODO: load train data (4darray, samples, img_width, img_height, nb_channels) IMPORTANT: 5 classes with 8 images each.

# preprocess data
my_preprocessing_function = mf.my_vgg16_preprocess_input

# 'augmentation' configuration we will use for training
train_datagen = ImageDataGenerator(preprocessing_function = my_preprocessing_function) # only preprocessing; static data set

# 'augmentation' configuration we will use for validation
val_datagen = ImageDataGenerator(preprocessing_function = my_preprocessing_function) # only preprocessing; static data set

train_data_dir = os.path.join(data_dir, 'train')
validation_data_dir = os.path.join(data_dir, 'val')
train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    shuffle = True,
    seed = seed, # random seed for shuffling and transformations
    class_mode='categorical')  # label type (categorical = one-hot vector)

validation_generator = val_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    shuffle = True,
    seed = seed, # random seed for shuffling and transformations
    class_mode='categorical')  # label type (categorical = one-hot vector)



############################## training ###############################
print("\n---------------------------------------------------------------")
print("------------------------ training model -----------------------")
print("---------------------------------------------------------------")
# create the base pre-trained model
base_model = VGG16(include_top=False, weights = None, input_shape=(img_width, img_height, 3), pooling = 'max', classes = nb_classes)
model_name =  "VGG_modified"

# do not freeze any layers --> all layers trainable
for layer in base_model.layers:
    layer.trainable = True

# define topping of base_model
x = base_model.output # get the last layer of our base_model
x = Dense(1024, activation='relu', name='fc1')(x)
x = Dense(1024, activation='relu', name='fc2')(x)
predictions = Dense(nb_classes, activation='softmax', name='predictions')(x)

# finally, stack model together
model = Model(outputs=predictions, name= model_name, inputs=base_model.input) #Keras 1.x.x: model = Model(input=base_model.input, output=predictions) 
print(model.summary())

# compile the model (should be done *after* setting layers to non-trainable)
model.compile(optimizer = chosen_optimizer, loss=loss_function, 
            metrics=['categorical_accuracy','kullback_leibler_divergence'])

# train the model on your data
model.fit_generator(
    train_generator,
    steps_per_epoch=steps_per_epoch,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=validation_steps,
    callbacks = [csvlogger, modelcheckpoint_best_epoch])



############################## evaluation ##############################
print("\n\n---------------------------------------------------------------")
print("------------------ Evaluation of Best Epoch -------------------")
print("---------------------------------------------------------------")
# load model (corresponding to best training epoch)
model = load_model(os.path.join(filepath, file_best_checkpoint))

# evaluate model on validation data (in test mode!)
list_of_metrics = model.evaluate(X_val, Y_val, batch_size=batch_size, verbose=1, sample_weight=None)
index = 0
print('\nMetrics:')
for metric in model.metrics_names:
    print(metric+ ':' , str(list_of_metrics[index]))
    index += 1


E D I T 2
参照E D I T的1 .: 如果在培训和评估(通过使用evaluate_generator())中使用相同的生成器生成验证数据,则仍然会出现问题. 因此,这肯定是由发电机引起的问题...


E D I T 2
Referring to 1. of E D I T: If I use the same generator for validation data during training and evaluation (by using evaluate_generator()), the problem still occurs. Hence, it is definitely a problem caused by the generators...

推荐答案

只有在验证数据集上评估指标时才会出现这种情况.

It will be the case only for the evalutation of the metrics on the validation dataset.

在训练期间在训练数据集上计算的指标不能反映模型的实际指标,因为该模型将在每个批次中进行更新(修改).

The metrics computed on the training dataset during training do not reflect the real metrics of the model at then end of the epoch as the model will be updated (modified) at each single batch.

这有帮助吗?

这篇关于为什么在Keras训练期间,由model.evaluate()计算的指标与跟踪的指标不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆