Keras model.fit日志和Sklearn.metrics.confusion_matrix报告的验证准确性度量标准不匹配 [英] Validation accuracy metrics reported by Keras model.fit log and Sklearn.metrics.confusion_matrix don't match each other

查看:224
本文介绍了Keras model.fit日志和Sklearn.metrics.confusion_matrix报告的验证准确性度量标准不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题是我从Keras model.fit历史记录获得的报告的validation accuracy值明显高于我从sklearn.metrics函数获得的validation accuracy指标.

The problem is that the reported validation accuracy value I get from Keras model.fit history is significantly higher than the validation accuracy metric I get from sklearn.metrics functions.

我从model.fit获得的结果总结如下:

The results I get from model.fit are summarized below:

Last Validation Accuracy: 0.81
Best Validation Accuracy: 0.84

sklearn的结果(规范化)非常不同:

The results (normalized) from sklearn are pretty different:

True Negatives: 0.78
True Positives: 0.77

Validation Accuracy = (TP + TN) / (TP + TN + FP + FN) = 0.775 

(see confusion matrix below for reference)

Edit: this calculation is incorrect, because one can not 
use the normalized values to calculate the accuracy, since 
it does not account for differences in the total absolute 
number of points in the dataset. Thanks to the comment by desertnaut

  • 这是来自model.fit历史记录的验证准确性数据的图形:

    • Here is the graph of the validation accuracy data from model.fit history:

      这是从sklearn生成的混淆矩阵:

      And here is the Confusion matrix generated from sklearn:

      我认为这个问题与此 Sklearn有点相似指标值与Keras值大不相同 但是我已经检查了两种方法都在同一个数据池上进行验证,因此答案可能不足以满足我的情况.

      I think this question is somewhat similar as this one Sklearn metrics values are very different from Keras values But I've checked both methods are doing the validation on the same pool of data, so that answer is probably not adequate for my case.

      此外,此问题 Keras二进制精度度量标准提供了太高的精度似乎解决了二进制交叉熵影响多类问题的方式中的某些问题,但是在我的情况下,它可能并不适用,因为它是一个真正的二进制分类问题.

      Also, this question Keras binary accuracy metric gives too high accuracy seems to address some problems with the way that binary cross entropy affects a multiclass problem, but in my case it may not apply, since it is a true binary classification problem.

      以下是使用的命令:

      模型定义:

      inputs = Input((Tx, ))
      n_e = 30
      embeddings = Embedding(n_x, n_e, input_length=Tx)(inputs)
      out = Bidirectional(LSTM(32, recurrent_dropout=0.5, return_sequences=True))(embeddings)
      out = Bidirectional(LSTM(16, recurrent_dropout=0.5, return_sequences=True))(out)
      out = Bidirectional(LSTM(16, recurrent_dropout=0.5))(out)
      out = Dense(3, activation='softmax')(out)
      modelo = Model(inputs=inputs, outputs=out)
      modelo.summary()
      

      模型摘要:

      _________________________________________________________________
      Layer (type)                 Output Shape              Param #   
      =================================================================
      input_1 (InputLayer)         (None, 100)               0         
      _________________________________________________________________
      embedding (Embedding)        (None, 100, 30)           86610     
      _________________________________________________________________
      bidirectional (Bidirectional (None, 100, 64)           16128     
      _________________________________________________________________
      bidirectional_1 (Bidirection (None, 100, 32)           10368     
      _________________________________________________________________
      bidirectional_2 (Bidirection (None, 32)                6272      
      _________________________________________________________________
      dense (Dense)                (None, 3)                 99        
      =================================================================
      Total params: 119,477
      Trainable params: 119,477
      Non-trainable params: 0
      _________________________________________________________________
      

      模型编译:

      mymodel.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
      

      模型拟合调用:

      num_epochs = 30
      myhistory = mymodel.fit(X_pad, y, epochs=num_epochs, batch_size=50, validation_data=[X_val_pad, y_val_oh], shuffle=True, callbacks=callbacks_list)
      

      模型拟合日志:

      Train on 505 samples, validate on 127 samples
      
      Epoch 1/30
      500/505 [============================>.] - ETA: 0s - loss: 0.6135 - acc: 0.6667
      [...]
      Epoch 10/30
      500/505 [============================>.] - ETA: 0s - loss: 0.1403 - acc: 0.9633
      Epoch 00010: val_acc improved from 0.77953 to 0.79528, saving model to modelo-10-melhor-modelo.hdf5
      505/505 [==============================] - 21s 41ms/sample - loss: 0.1393 - acc: 0.9637 - val_loss: 0.5203 - val_acc: 0.7953
      Epoch 11/30
      500/505 [============================>.] - ETA: 0s - loss: 0.0865 - acc: 0.9840
      Epoch 00011: val_acc did not improve from 0.79528
      505/505 [==============================] - 21s 41ms/sample - loss: 0.0860 - acc: 0.9842 - val_loss: 0.5257 - val_acc: 0.7953
      Epoch 12/30
      500/505 [============================>.] - ETA: 0s - loss: 0.0618 - acc: 0.9900
      Epoch 00012: val_acc improved from 0.79528 to 0.81102, saving model to modelo-10-melhor-modelo.hdf5
      505/505 [==============================] - 21s 42ms/sample - loss: 0.0615 - acc: 0.9901 - val_loss: 0.5472 - val_acc: 0.8110
      Epoch 13/30
      500/505 [============================>.] - ETA: 0s - loss: 0.0415 - acc: 0.9940
      Epoch 00013: val_acc improved from 0.81102 to 0.82152, saving model to modelo-10-melhor-modelo.hdf5
      505/505 [==============================] - 21s 42ms/sample - loss: 0.0413 - acc: 0.9941 - val_loss: 0.5853 - val_acc: 0.8215
      Epoch 14/30
      500/505 [============================>.] - ETA: 0s - loss: 0.0443 - acc: 0.9933
      Epoch 00014: val_acc did not improve from 0.82152
      505/505 [==============================] - 21s 42ms/sample - loss: 0.0453 - acc: 0.9921 - val_loss: 0.6043 - val_acc: 0.8136
      Epoch 15/30
      500/505 [============================>.] - ETA: 0s - loss: 0.0360 - acc: 0.9933
      Epoch 00015: val_acc improved from 0.82152 to 0.84777, saving model to modelo-10-melhor-modelo.hdf5
      505/505 [==============================] - 21s 42ms/sample - loss: 0.0359 - acc: 0.9934 - val_loss: 0.5663 - val_acc: 0.8478
      [...]
      Epoch 30/30
      500/505 [============================>.] - ETA: 0s - loss: 0.0039 - acc: 1.0000
      Epoch 00030: val_acc did not improve from 0.84777
      505/505 [==============================] - 20s 41ms/sample - loss: 0.0039 - acc: 1.0000 - val_loss: 0.8340 - val_acc: 0.8110
      

      sklearn的混淆矩阵:

      Confusion matrix from sklearn:

      from sklearn.metrics import confusion_matrix
      conf_mat = confusion_matrix(y_values, predicted_values)
      

      预测值和黄金值确定如下:

      The prediction values and gold values are determined as follows:

      preds = mymodel.predict(X_val)
      preds_ints = [[el] for el in np.argmax(preds, axis=1)]
      values_pred = tokenizer_y.sequences_to_texts(preds_ints)
      values_gold = tokenizer_y.sequences_to_texts(y_val)
      

      最后,我想补充一点,我已经打印出了数据和所有预测错误,并且我相信sklearn值更可靠,因为它们似乎与我为保存的最佳"模型.

      Finally, I'd like to add that I have printed out the data and all prediction errors and I believe the sklearn values are more reliable, since they seem to match the results I get from printing out the predictions for the saved "best" model.

      另一方面,我无法理解指标之间的差异.由于它们都是众所周知的软件,因此我得出结论,我是在这里犯错的人,但是我无法确定在哪里或如何.

      On the other hand, I can't understand how the metrics can be so different. Since they are both very well know softwares, I conclude I'm the one making the mistake here, but I can't pin down where or how.

      推荐答案

      您的问题不恰当;如前所述,您尚未计算scikit-learn模型的实际准确性,因此您似乎将苹果与橙子进行了比较.归一化混淆矩阵的计算(TP + TN)/2 没有给出准确性.这是一个使用玩具数据的简单演示,它是根据 docs :

      Your question is ill-posed; as already commented, you have not computed the actual accuracy of your scikit-learn model, hence you seem to compare apples with oranges. The computation (TP + TN)/2 from a normalized confusion matrix does not give the accuracy. Here is a simple deomonstration using toy data, adapting the plot_confusion_matrix from the docs:

      import numpy as np
      import matplotlib.pyplot as plt
      from sklearn.metrics import confusion_matrix
      
      # toy data
      y_true = [0, 1, 0, 1, 0, 0, 0, 1]
      y_pred =  [1, 1, 1, 0, 1, 1, 0, 1]
      class_names=[0,1]
      
      # plot_confusion_matrix function
      
      def plot_confusion_matrix(y_true, y_pred, classes,
                                normalize=False,
                                title=None,
                                cmap=plt.cm.Blues):
          """
          This function prints and plots the confusion matrix.
          Normalization can be applied by setting `normalize=True`.
          """
          if not title:
              if normalize:
                  title = 'Normalized confusion matrix'
              else:
                  title = 'Confusion matrix, without normalization'
      
          # Compute confusion matrix
          cm = confusion_matrix(y_true, y_pred)
      
          if normalize:
              cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
              print("Normalized confusion matrix")
          else:
              print('Confusion matrix, without normalization')
      
          print(cm)
      
          fig, ax = plt.subplots()
          im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
          ax.figure.colorbar(im, ax=ax)
          # We want to show all ticks...
          ax.set(xticks=np.arange(cm.shape[1]),
                 yticks=np.arange(cm.shape[0]),
                 # ... and label them with the respective list entries
                 xticklabels=classes, yticklabels=classes,
                 title=title,
                 ylabel='True label',
                 xlabel='Predicted label')
      
          # Rotate the tick labels and set their alignment.
          plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
                   rotation_mode="anchor")
      
          # Loop over data dimensions and create text annotations.
          fmt = '.2f' if normalize else 'd'
          thresh = cm.max() / 2.
          for i in range(cm.shape[0]):
              for j in range(cm.shape[1]):
                  ax.text(j, i, format(cm[i, j], fmt),
                          ha="center", va="center",
                          color="white" if cm[i, j] > thresh else "black")
          fig.tight_layout()
          return ax
      

      计算归一化混淆矩阵会得出:

      plot_confusion_matrix(y_true, y_pred, classes=class_names, normalize=True)
      # result:
      Normalized confusion matrix
      [[ 0.2         0.8       ]
       [ 0.33333333  0.66666667]]
      

      ,根据您的不正确原理,准确性应为:

      and according to your incorrect rationale, the accuracy should be:

      (0.67 + 0.2)/2
      # 0.435
      

      (请注意如何在归一化矩阵中相加达到100%,而在完全混淆矩阵中不会发生这种情况)

      (Notice how in the normalized matrix the rows add to 100%, something that does not happen in the full confusion matrix)

      但是,现在让我们看看未归一化混淆矩阵的真实准确度是什么:

      But let's now see what the real accuracy is from the un-normalized confusion matrix:

      plot_confusion_matrix(y_true, y_pred, classes=class_names) # normalize=False by default
      # result
      Confusion matrix, without normalization
      [[1 4]
       [1 2]]
      

      根据精确度定义为(TP + TN)/(TP + TN + FP + FN),我们得到:

      from which, by the definition of accuracy as (TP + TN) / (TP + TN + FP + FN), we get:

      (1+2)/(1+2+4+1)
      # 0.375
      

      当然,我们不需要混淆矩阵来获得诸如精度之类的基本知识;正如评论中已经建议的那样,我们可以简单地使用scikit-learn的内置accuracy_score方法:

      Of course, we don't need the confusion matrix to get something so elementary as the accuracy; as already advised in the comments, we can simply use the built-in accuracy_score method of scikit-learn:

      from sklearn.metrics import accuracy_score
      accuracy_score(y_true, y_pred)
      # 0.375
      

      这毫不奇怪,它与我们从混淆矩阵进行的直接计算相吻合.

      which, rather unsurprisingly, agrees with our direct computation from the confusion matrix.

      底线:

      • 在存在特定方法(例如accuracy_score)的地方,绝对最好使用它们代替临时灵感,特别是当某些东西看起来不正确时(例如Keras和scikit之间的差异)了解报告的准确性)
      • 在此示例中,实际精度低于您自己计算的精度,这一事实显然无法说明您所报告的特定问题
      • 如果即使在计算出正确的数据准确性之后,仍然存在与Keras的差异,请根据新情况更改问题,因为尽管这样,事实也会使答案无效它突出显示了您方法中的错误点-而是打开一个新问题
      • where specific methods (like accuracy_score) exist, it is definitely preferable to use them instead of ad hoc inspirations, especially when something does not look right (like a discrepancy between Keras and scikit-learn reported accuracies)
      • the fact that in this example the actual accuracy was lower than the one computed by your own way obviously does not say anything for the specific problem you report
      • if the discrepancy with Keras still exists even after computing the correct accuracy for your data, please do not alter the question with the new situation, as this would make the answer invalid, despite the fact that it highlights a mistaken point in your method - open a new question instead

      这篇关于Keras model.fit日志和Sklearn.metrics.confusion_matrix报告的验证准确性度量标准不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆