如何使用K折交叉验证来计算准确性和混淆矩阵? [英] How to compute accuracy and the confusion matrix using K-fold cross-validation?

查看:1275
本文介绍了如何使用K折交叉验证来计算准确性和混淆矩阵?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试用K = 30折进行K折交叉验证,每折使用一个混淆矩阵。如何计算具有置信区间的模型的准确性和混淆矩阵?
有人可以帮我吗?

I tried to do K-fold cross-validation with K=30 folds, with one confusion matrix for each fold. How to compute the accuracy and the confusion matrix to the model with confidence interval? Could someone help me?

我的代码是:

import numpy as np
from sklearn import model_selection
from sklearn import datasets
from sklearn import svm
import pandas as pd
from sklearn.linear_model import LogisticRegression

UNSW = pd.read_csv('/home/sec/Desktop/CEFET/tudao.csv')

previsores = UNSW.iloc[:,UNSW.columns.isin(('sload','dload',
                                                   'spkts','dpkts','swin','dwin','smean','dmean',
'sjit','djit','sinpkt','dinpkt','tcprtt','synack','ackdat','ct_srv_src','ct_srv_dst','ct_dst_ltm',
 'ct_src_ltm','ct_src_dport_ltm','ct_dst_sport_ltm','ct_dst_src_ltm')) ].values


classe= UNSW.iloc[:, -1].values


X_train, X_test, y_train, y_test = model_selection.train_test_split(
previsores, classe, test_size=0.4, random_state=0)

print(X_train.shape, y_train.shape)
#((90, 4), (90,))
print(X_test.shape, y_test.shape)
#((60, 4), (60,))

logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
print(previsores.shape)


########K FOLD
print('########K FOLD########K FOLD########K FOLD########K FOLD')
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix

kf = KFold(n_splits=30, random_state=None, shuffle=False)
kf.get_n_splits(previsores)
for train_index, test_index in kf.split(previsores):

    X_train, X_test = previsores[train_index], previsores[test_index]
    y_train, y_test = classe[train_index], classe[test_index]

    logmodel.fit(X_train, y_train)
    print (confusion_matrix(y_test, logmodel.predict(X_test)))
print(10* '#')


推荐答案

为了准确起见,我将使用函数 cross_val_score 来实现您的工作寻找。它输出30个验证精度的列表,然后您可以计算它们的平均值,标准偏差等,并创建某种类型的置信区间(平均值+-2 * std)

For accuracy, I would use the function cross_val_score that does exactly what you are looking for. It outputs a list of 30 validation accuracies and you can then compute their mean, standard deviation, etc and create some kind of a confidence interval (mean +- 2*std) .

由于不能将混淆矩阵视为性能指标(不是单个数字而是矩阵),我建议创建一个列表,然后迭代地将其附加一个相应的验证混淆矩阵(当前只打印它) 。最后,您可以使用此列表提取很多有趣的信息。

Since confusion matrix cannot be seen as a performance metric (not a single number but a matrix) I would recommend creating a list and then iteratively just append it with a corresponding validation confusion matrix (currently you just print it). At the end, you can use this list to extract a lot of interesting information.

更新:

...
...
cm_holder = []
for train_index, test_index in kf.split(previsores):
    X_train, X_test = previsores[train_index], previsores[test_index]
    y_train, y_test = classe[train_index], classe[test_index]

    logmodel.fit(X_train, y_train)
    cm_holder.append(confusion_matrix(y_test, logmodel.predict(X_test))))

请注意, len(cm_holder) = 30,每个元素都是一个 shape =(n_classes,n_classes)的数组。

Note that the len(cm_holder) = 30 and each of the elements is an array of shape=(n_classes, n_classes).

这篇关于如何使用K折交叉验证来计算准确性和混淆矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆