Sklearn指标值与Keras值有很大不同 [英] Sklearn metrics values are very different from Keras values

查看:296
本文介绍了Sklearn指标值与Keras值有很大不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一些帮助,以了解在Keras中拟合模型时如何计算准确性. 这是训练模型的示例历史记录:

 Train on 340 samples, validate on 60 samples
Epoch 1/100
340/340 [==============================] - 5s 13ms/step - loss: 0.8081 - acc: 0.7559 - val_loss: 0.1393 - val_acc: 1.0000
Epoch 2/100
340/340 [==============================] - 3s 9ms/step - loss: 0.7815 - acc: 0.7647 - val_loss: 0.1367 - val_acc: 1.0000
Epoch 3/100
340/340 [==============================] - 3s 10ms/step - loss: 0.8042 - acc: 0.7706 - val_loss: 0.1370 - val_acc: 1.0000
...
Epoch 25/100
340/340 [==============================] - 3s 9ms/step - loss: 0.6006 - acc: 0.8029 - val_loss: 0.2418 - val_acc: 0.9333
Epoch 26/100
340/340 [==============================] - 3s 9ms/step - loss: 0.5799 - acc: 0.8235 - val_loss: 0.3004 - val_acc: 0.8833
 

那么,在最初的几个阶段中,验证精度为1?验证精度如何比训练精度更好?

这些数字显示了准确性和损失的所有值:

然后我使用sklearn指标评估最终结果:

 def evaluate(predicted_outcome, expected_outcome):
    f1_score = metrics.f1_score(expected_outcome, predicted_outcome, average='weighted')
    balanced_accuracy_score = metrics.balanced_accuracy_score(expected_outcome, predicted_outcome)
    print('****************************')
    print('| MODEL PERFORMANCE REPORT |')
    print('****************************')
    print('Average F1 score = {:0.2f}.'.format(f1_score))
    print('Balanced accuracy score = {:0.2f}.'.format(balanced_accuracy_score))
    print('Confusion matrix')
    print(metrics.confusion_matrix(expected_outcome, predicted_outcome))
    print('Other metrics')
    print(metrics.classification_report(expected_outcome, predicted_outcome))
 

我得到以下输出(如您所见,结果很糟糕):

 ****************************
| MODEL PERFORMANCE REPORT |
****************************
Average F1 score = 0.25.
Balanced accuracy score = 0.32.
Confusion matrix
[[  7  24   2  40]
 [ 11  70   4 269]
 [  0   0   0  48]
 [  0   0   0   6]]
Other metrics
              precision    recall  f1-score   support

           0       0.39      0.10      0.15        73
           1       0.74      0.20      0.31       354
           2       0.00      0.00      0.00        48
           3       0.02      1.00      0.03         6

   micro avg       0.17      0.17      0.17       481
   macro avg       0.29      0.32      0.12       481
weighted avg       0.61      0.17      0.25       481
 

为什么Keras拟合函数的准确性和损失值与sklearn指标的值如此不同?

这是我的模型,以防万一:

 model = Sequential()

model.add(LSTM(
                units=100, # the number of hidden states
                return_sequences=True, 
                input_shape=(timestamps,nb_features),
                dropout=0.2, 
                recurrent_dropout=0.2
              )
         )

model.add(Dropout(0.2))

model.add(Flatten())

model.add(Dense(units=nb_classes,
               activation='softmax'))

model.compile(loss="categorical_crossentropy",
              metrics = ['accuracy'],
              optimizer='adadelta')
 

输入数据尺寸:

 400 train sequences
481 test sequences
X_train shape: (400, 20, 17)
X_test shape: (481, 20, 17)
y_train shape: (400, 4)
y_test shape: (481, 4)
 

这是我应用sklearn指标的方式:

 testPredict = model.predict(np.array(X_test))
y_test = np.argmax(y_test.values, axis=1)
y_pred = np.argmax(testPredict, axis=1)
evaluate(y_pred, y_test)
 

我好像错过了一些东西.

解决方案

您听起来有些困惑.

首先,您将苹果与桔子进行比较,即Keras在60个样本集上报告的 validation 准确性(请注意Keras打印的第一条信息性消息,Train on 340 samples, validate on 60 samples)与scikit-learn在您的481个样本测试集上报告了 test 准确性.

第二,您的仅60个样本的验证集太小了;在如此小的样本中,计算得出的指标(例如您报告的指标)的剧烈波动肯定不是意外的(这是我们需要足够大小的数据集的原因,而不仅仅是 training 的原因).

第三,至少可以说,您的训练/验证/测试集划分非常不寻常;标准做法要求分配大约70/15/15%或类似的费用,而您使用的分配是38/7/55%(即340/60/481样本)...

最后,在不知道数据详细信息的情况下,很可能只有340个样本不足以适合您这样的LSTM模型,从而无法完成4类分类任务.

对于初学者,首先将数据更适当地分配到训练/验证/测试集中,并确保将苹果与苹果进行比较...

PS在类似的问题中,您还应该包括您的model.fit()部分.

I need some help in order to understand how accuracy is calculated when fitting a model in Keras. This is a sample history of training the model:

Train on 340 samples, validate on 60 samples
Epoch 1/100
340/340 [==============================] - 5s 13ms/step - loss: 0.8081 - acc: 0.7559 - val_loss: 0.1393 - val_acc: 1.0000
Epoch 2/100
340/340 [==============================] - 3s 9ms/step - loss: 0.7815 - acc: 0.7647 - val_loss: 0.1367 - val_acc: 1.0000
Epoch 3/100
340/340 [==============================] - 3s 10ms/step - loss: 0.8042 - acc: 0.7706 - val_loss: 0.1370 - val_acc: 1.0000
...
Epoch 25/100
340/340 [==============================] - 3s 9ms/step - loss: 0.6006 - acc: 0.8029 - val_loss: 0.2418 - val_acc: 0.9333
Epoch 26/100
340/340 [==============================] - 3s 9ms/step - loss: 0.5799 - acc: 0.8235 - val_loss: 0.3004 - val_acc: 0.8833

So, validation accuracy is 1 in the first epochs? How can the validation accuracy be better than the training accuracy?

This are figures that show all values of accuracy and loss:

Then I use sklearn metrics to evaluate final results:

def evaluate(predicted_outcome, expected_outcome):
    f1_score = metrics.f1_score(expected_outcome, predicted_outcome, average='weighted')
    balanced_accuracy_score = metrics.balanced_accuracy_score(expected_outcome, predicted_outcome)
    print('****************************')
    print('| MODEL PERFORMANCE REPORT |')
    print('****************************')
    print('Average F1 score = {:0.2f}.'.format(f1_score))
    print('Balanced accuracy score = {:0.2f}.'.format(balanced_accuracy_score))
    print('Confusion matrix')
    print(metrics.confusion_matrix(expected_outcome, predicted_outcome))
    print('Other metrics')
    print(metrics.classification_report(expected_outcome, predicted_outcome))

I get this output (as you can see, the results are terrible):

****************************
| MODEL PERFORMANCE REPORT |
****************************
Average F1 score = 0.25.
Balanced accuracy score = 0.32.
Confusion matrix
[[  7  24   2  40]
 [ 11  70   4 269]
 [  0   0   0  48]
 [  0   0   0   6]]
Other metrics
              precision    recall  f1-score   support

           0       0.39      0.10      0.15        73
           1       0.74      0.20      0.31       354
           2       0.00      0.00      0.00        48
           3       0.02      1.00      0.03         6

   micro avg       0.17      0.17      0.17       481
   macro avg       0.29      0.32      0.12       481
weighted avg       0.61      0.17      0.25       481

Why the accuracy and loss values of Keras fit functions are so different from the values of sklearn metrics?

This is my model, in case it helps:

model = Sequential()

model.add(LSTM(
                units=100, # the number of hidden states
                return_sequences=True, 
                input_shape=(timestamps,nb_features),
                dropout=0.2, 
                recurrent_dropout=0.2
              )
         )

model.add(Dropout(0.2))

model.add(Flatten())

model.add(Dense(units=nb_classes,
               activation='softmax'))

model.compile(loss="categorical_crossentropy",
              metrics = ['accuracy'],
              optimizer='adadelta')

Input data dimensions:

400 train sequences
481 test sequences
X_train shape: (400, 20, 17)
X_test shape: (481, 20, 17)
y_train shape: (400, 4)
y_test shape: (481, 4)

This is how I apply sklearn metrics:

testPredict = model.predict(np.array(X_test))
y_test = np.argmax(y_test.values, axis=1)
y_pred = np.argmax(testPredict, axis=1)
evaluate(y_pred, y_test)

It looks that I miss something.

解决方案

You sound a little confused.

To start with, you are comparing apples to oranges, i.e. the validation accuracy reported by Keras on a 60-sample set (notice the first informative message printed by Keras, Train on 340 samples, validate on 60 samples) with the test accuracy reported by scikit-learn on your 481-sample test set.

Second, your validation set of only 60 samples is way too small; in such small samples, wild fluctuations of the calculated metrics such as the ones you report are certainly not unexpected (there is a reason why we need datasets of sufficient size, and not only training ones).

Third, your training/validation/test set division is quite unusual, to say the least; standard practice asks for allocations of roughly 70/15/15 per cent or similar, while you are using an allocation of 38/7/55 per cent (i.e. 340/60/481 samples)...

Lastly, and without knowing the details of your data, it may very well be the case that only 340 samples are not enough to fit an LSTM model such as yours for a good 4-class classification task.

For starters, start with a more appropriate allocation of your data into training/validation/test sets and be sure you compare apples to apples...

PS In similar questions, you should also include your model.fit() part.

这篇关于Sklearn指标值与Keras值有很大不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆