使用python和Scikit Learn为K-NN机器学习算法实现ROC曲线 [英] Implementing ROC Curves for K-NN machine learning algorithm using python and Scikit Learn

查看:97
本文介绍了使用python和Scikit Learn为K-NN机器学习算法实现ROC曲线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试为我的kNN分类算法实现ROC曲线.我知道ROC曲线是正确率与错误率的关系图,我只是在努力从数据集中查找那些值.我将"autoimmune.csv"导入到我的python脚本中,并在其上运行kNN算法以输出准确性值.Scikit-learn.org文档显示,要生成TPR和FPR,我需要传递y_test和y_scores值,如下所示:

  fpr,tpr,阈值= roc_curve(y_test,y_scores) 

我只是在努力使用这些值.感谢您的事先帮助和歉意,如果我错过了某些事情,这是我的第一篇文章.

来自sklearn.neighbors的

 导入KNeighborsClassifier从sklearn.model_selection导入train_test_split从sklearn.model_selection导入cross_val_score从sklearn.metrics导入roc_curve从sklearn.metrics导入auc将熊猫作为pd导入将numpy导入为np导入matplotlib.pyplot作为plt数据= pd.read_csv('./autoimmune.csv')X = data.drop(columns = ['autoimmune'])y = data ['autoimmune'].valuesX_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2)knn = KNeighborsClassifier(n_neighbors = 10)knn.fit(X_train,y_train)knn.predict(X_test)[0:10]knn.score(X_test,y_test)print(测试集得分:{:.4f}".format(knn.score(X_test,y_test)))knn_cv = KNeighborsClassifier(n_neighbors = 10)cv_scores = cross_val_score(knn_cv,X,y,cv = 10)打印(cv_scores)print('cv_scores意思是:{}'.format(np.mean(cv_scores)))y_scores = cross_val_score(knn_cv,X,y,cv = 76)fpr,tpr,阈值= roc_curve(y_test,y_scores)roc_auc = auc(fpr,tpr)打印(roc_auc)plt.title('接收器工作特性')plt.plot(fpr,tpr,'b',标签='AUC =%0.2f'%roc_auc)plt.legend(loc ='右下')plt.plot([0,1],[0,1],'r--')plt.xlim([0,1])plt.ylim([0,1])plt.ylabel('真正率')plt.xlabel('假阳性率')plt.title('kNN的ROC曲线')plt.show() 

解决方案

如果您查看

I am currently trying to implement an ROC Curve for my kNN classification algorithm. I am aware that an ROC Curve is a plot of True Positive Rate vs False Positive Rate, I am just struggling with finding those values from my dataset. I import 'autoimmune.csv' into my python script and run the kNN algorithm on it to output an accuracy value. Scikit-learn.org documentation shows that to generate the TPR and FPR I need to pass in values of y_test and y_scores as shown below:

fpr, tpr, threshold = roc_curve(y_test, y_scores)

I am just struggling with what I should be using as these values. Thanks for your help in advance and apologies if there is something I have missed as it is my first post here.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = pd.read_csv('./autoimmune.csv')
X = data.drop(columns=['autoimmune'])
y = data['autoimmune'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

knn = KNeighborsClassifier(n_neighbors = 10)
knn.fit(X_train,y_train)
knn.predict(X_test)[0:10]
knn.score(X_test,y_test)

print("Test set score: {:.4f}".format(knn.score(X_test, y_test)))

knn_cv = KNeighborsClassifier(n_neighbors=10)
cv_scores = cross_val_score(knn_cv, X, y, cv=10)
print(cv_scores)
print('cv_scores mean:{}' .format(np.mean(cv_scores)))


y_scores = cross_val_score(knn_cv, X, y, cv=76)
fpr, tpr, threshold = roc_curve(y_test, y_scores)
roc_auc = auc(fpr, tpr)
print(roc_auc)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC Curve of kNN')
plt.show()

解决方案

If you look at the documentation for roc_curve(), you will see the following regarding the y_score parameter:

y_score : array, shape = [n_samples] Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by "decision_function" on some classifiers).

You can get probability estimates using the predict_proba() method of the KNeighborsClassifier in sklearn. This returns a numpy array with two columns for a binary classification, one each for the negative and positive class. For the roc_curve() function you want to use probability estimates of the positive class, so you can replace your:

y_scores = cross_val_score(knn_cv, X, y, cv=76)
fpr, tpr, threshold = roc_curve(y_test, y_scores)

with:

y_scores = knn.predict_proba(X_test)
fpr, tpr, threshold = roc_curve(y_test, y_scores[:, 1])

Notice how you need to take all the rows of the second column with [:, 1] to only select the probability estimates of the positive class. Here's a minimal reproducible example using the Wisconsin breast cancer dataset, since I don't have your autoimmune.csv:

from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
import matplotlib.pyplot as plt

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

knn = KNeighborsClassifier(n_neighbors = 10)
knn.fit(X_train,y_train)

y_scores = knn.predict_proba(X_test)
fpr, tpr, threshold = roc_curve(y_test, y_scores[:, 1])
roc_auc = auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC Curve of kNN')
plt.show()

This produces the following ROC curve:

这篇关于使用python和Scikit Learn为K-NN机器学习算法实现ROC曲线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆