使用交叉验证评估逻辑回归 [英] Evaluating Logistic regression with cross validation

查看:33
本文介绍了使用交叉验证评估逻辑回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用交叉验证来测试/训练我的数据集并评估逻辑回归模型在整个数据集上的性能,而不仅仅是在测试集上(例如 25%).

这些概念对我来说是全新的,我不太确定我是否做得对.如果有人能就我出错的地方采取正确的步骤向我提出建议,我将不胜感激.我的部分代码如下所示.

此外,如何在当前图形的同一图形上绘制y2"和y3"的 ROC?

谢谢

将pandas导入为pdData=pd.read_csv ('C:\Dataset.csv',index_col='SNo')feature_cols=['A','B','C','D','E']X=数据[feature_cols]Y=数据['状态']Y1=Data['Status1'] # 来自别处的预测Y2=Data['Status2'] # 来自别处的预测从 sklearn.linear_model 导入 LogisticRegressionlogreg=LogisticRegression()logreg.fit(X_train,y_train)从 sklearn.cross_validation 导入 train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)从 sklearn 导入指标,cross_validation预测 = cross_validation.cross_val_predict(logreg, X, y, cv=10)metrics.accuracy_score(y,预测)从 sklearn.cross_validation 导入 cross_val_score准确度 = cross_val_score(logreg, X, y, cv=10,scoring='accuracy')打印(精度)打印 (cross_val_score(logreg, X, y, cv=10,scoring='accuracy').mean())从 nltk 导入混淆矩阵打印(混淆矩阵(列表(y),列表(预测)))#print (ConfusionMatrix(list(y), list(yexpert)))# 灵敏度:打印(metrics.recall_score(y,预测))导入 matplotlib.pyplot 作为 plt概率 = logreg.predict_proba(X)[:, 1]plt.hist(问题)plt.show()# 使用 0.5 截止值来预测默认"将 numpy 导入为 nppreds = np.where(probs > 0.5, 1, 0)打印(混淆矩阵(列表(y),列表(预测)))# 检查准确性、敏感性、特异性打印(metrics.accuracy_score(y,预测))#ROC 曲线和 AUC#绘制ROC曲线fpr, tpr, 阈值 = metrics.roc_curve(y, probs)plt.plot(fpr, tpr)plt.xlim([0.0, 1.0])plt.ylim([0.0, 1.0])plt.xlabel('误报率')plt.ylabel('真阳性率)')plt.show()# 计算 AUC打印 (metrics.roc_auc_score(y, probs))# 使用 AUC 作为交叉验证的评估指标从 sklearn.cross_validation 导入 cross_val_scorelogreg = LogisticRegression()cross_val_score(logreg, X, y, cv=10, score='roc_auc').mean()

解决方案

你几乎答对了.cross_validation.cross_val_predict 为您提供对整个数据集的预测.您只需要在代码的前面删除 logreg.fit.具体来说,它的作用如下:它将您的数据集划分为 n 个折叠,并在每次迭代中将其中一个折叠作为测试集,并在其余折叠上训练模型 (n-1 折叠).因此,最终您将获得对整个数据的预测.

让我们用 sklearn 中的一个内置数据集 iris 来说明这一点.该数据集包含 150 个具有 4 个特征的训练样本.iris['data']Xiris['target']y

在[15]中:iris['data'].shape输出[15]:(150, 4)

要通过交叉验证对整个集合进行预测,您可以执行以下操作:

from sklearn.linear_model import LogisticRegression从 sklearn 导入指标,cross_validation从 sklearn 导入数据集虹膜 = datasets.load_iris()预测 = cross_validation.cross_val_predict(LogisticRegression(), iris['data'], iris['target'], cv=10)打印metrics.accuracy_score(虹膜['目标'],预测)输出 [1]:0.9537打印metrics.classification_report(虹膜['目标'],预测)出 [2] :精确召回f1-score支持0 1.00 1.00 1.00 501 0.96 0.90 0.93 502 0.91 0.96 0.93 50平均/总计 0.95 0.95 0.95 150

那么,回到你的代码.你只需要这个:

from sklearn import metrics, cross_validationlogreg=LogisticRegression()预测 = cross_validation.cross_val_predict(logreg, X, y, cv=10)打印metrics.accuracy_score(y,预测)打印metrics.classification_report(y,预测)

对于在多类分类中绘制 ROC,您可以按照 .

I would like to use cross validation to test/train my dataset and evaluate the performance of the logistic regression model on the entire dataset and not only on the test set (e.g. 25%).

These concepts are totally new to me and am not very sure if am doing it right. I would be grateful if anyone could advise me on the right steps to take where I have gone wrong. Part of my code is shown below.

Also, how can I plot ROCs for "y2" and "y3" on the same graph with the current one?

Thank you

import pandas as pd 
Data=pd.read_csv ('C:\Dataset.csv',index_col='SNo')
feature_cols=['A','B','C','D','E']
X=Data[feature_cols]

Y=Data['Status'] 
Y1=Data['Status1']  # predictions from elsewhere
Y2=Data['Status2'] # predictions from elsewhere

from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(X_train,y_train)

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

from sklearn import metrics, cross_validation
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10)
metrics.accuracy_score(y, predicted) 

from sklearn.cross_validation import cross_val_score
accuracy = cross_val_score(logreg, X, y, cv=10,scoring='accuracy')
print (accuracy)
print (cross_val_score(logreg, X, y, cv=10,scoring='accuracy').mean())

from nltk import ConfusionMatrix 
print (ConfusionMatrix(list(y), list(predicted)))
#print (ConfusionMatrix(list(y), list(yexpert)))

# sensitivity:
print (metrics.recall_score(y, predicted) )

import matplotlib.pyplot as plt 
probs = logreg.predict_proba(X)[:, 1] 
plt.hist(probs) 
plt.show()

# use 0.5 cutoff for predicting 'default' 
import numpy as np 
preds = np.where(probs > 0.5, 1, 0) 
print (ConfusionMatrix(list(y), list(preds)))

# check accuracy, sensitivity, specificity 
print (metrics.accuracy_score(y, predicted)) 

#ROC CURVES and AUC 
# plot ROC curve 
fpr, tpr, thresholds = metrics.roc_curve(y, probs) 
plt.plot(fpr, tpr) 
plt.xlim([0.0, 1.0]) 
plt.ylim([0.0, 1.0]) 
plt.xlabel('False Positive Rate') 
plt.ylabel('True Positive Rate)') 
plt.show()

# calculate AUC 
print (metrics.roc_auc_score(y, probs))

# use AUC as evaluation metric for cross-validation 
from sklearn.cross_validation import cross_val_score 
logreg = LogisticRegression() 
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean() 

解决方案

You got it almost right. cross_validation.cross_val_predict gives you predictions for the entire dataset. You just need to remove logreg.fit earlier in the code. Specifically, what it does is the following: It divides your dataset in to n folds and in each iteration it leaves one of the folds out as the test set and trains the model on the rest of the folds (n-1 folds). So, in the end you will get predictions for the entire data.

Let's illustrate this with one of the built-in datasets in sklearn, iris. This dataset contains 150 training samples with 4 features. iris['data'] is X and iris['target'] is y

In [15]: iris['data'].shape
Out[15]: (150, 4)

To get predictions on the entire set with cross validation you can do the following:

from sklearn.linear_model import LogisticRegression
from sklearn import metrics, cross_validation
from sklearn import datasets
iris = datasets.load_iris()
predicted = cross_validation.cross_val_predict(LogisticRegression(), iris['data'], iris['target'], cv=10)
print metrics.accuracy_score(iris['target'], predicted)

Out [1] : 0.9537

print metrics.classification_report(iris['target'], predicted) 

Out [2] :
                     precision    recall  f1-score   support

                0       1.00      1.00      1.00        50
                1       0.96      0.90      0.93        50
                2       0.91      0.96      0.93        50

      avg / total       0.95      0.95      0.95       150

So, back to your code. All you need is this:

from sklearn import metrics, cross_validation
logreg=LogisticRegression()
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10)
print metrics.accuracy_score(y, predicted)
print metrics.classification_report(y, predicted) 

For plotting ROC in multi-class classification, you can follow this tutorial which gives you something like the following:

In general, sklearn has very good tutorials and documentation. I strongly recommend reading their tutorial on cross_validation.

这篇关于使用交叉验证评估逻辑回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆