如何根据分类树概率绘制ROC曲线 [英] How to plot a ROC curve from Classification Tree probabilities
问题描述
我正在尝试绘制具有分类树概率的ROC曲线。但是,当我绘制曲线时,它不存在。我试图绘制ROC曲线,然后从曲线下的区域中找到AUC值。有谁知道如何解决这一问题?谢谢,如果可以的话。二进制列Risk代表风险分类错误,我认为这是我的标签。我应该在代码的其他位置应用ROC曲线方程吗?
I am attempting to plot a ROC curve with classification trees probabilities. However, when I plot the curve, it is absent. I am trying to plot the ROC curve and then find the AUC value from the area under the curve. Does anyone know how to fix this? Thank you if you can. The binary column Risk stands for risk misclassification, which I presume is my label. Should I be applying the ROC curve equation at a different point in my code?
以下是数据框:
library(ROCR)
data(Risk.table)
pred = prediction(Risk.table$Predicted.prob, Risk.table2$Risk)
perf = performance(pred, measure="tpr", x.measure="fpr")
perf
plot(perf)
Predicted.prob Actual.prob predicted actual Risk
1 0.5384615 0.4615385 G8 V4 0
2 0.1212121 0.8787879 V4 V4 1
3 0.5384615 0.4615385 G8 G8 1
4 0.9000000 0.1000000 G8 G8 1
5 0.1212121 0.8787879 V4 V4 1
6 0.1212121 0.8787879 V4 V4 1
7 0.9000000 0.1000000 G8 G8 1
8 0.5384615 0.4615385 G8 V4 0
9 0.5384615 0.4615385 G8 V4 0
10 0.1212121 0.8787879 V4 G8 0
11 0.1212121 0.8787879 V4 V4 1
12 0.9000000 0.1000000 G8 V4 0
13 0.9000000 0.1000000 G8 V4 0
14 0.1212121 0.8787879 G8 V4 1
15 0.9000000 0.1000000 G8 G8 1
16 0.5384615 0.4615385 G8 V4 0
17 0.9000000 0.1000000 G8 V4 0
18 0.1212121 0.8787879 V4 V4 1
19 0.5384615 0.4615385 G8 V4 0
20 0.1212121 0.8787879 V4 V4 1
21 0.9000000 0.1000000 G8 G8 1
22 0.5384615 0.4615385 G8 V4 0
23 0.9000000 0.1000000 G8 V4 0
24 0.1212121 0.8787879 V4 V4 1
以下是此代码输出的ROC曲线,但该曲线缺失:
Here is the ROC curve this code outputs, but the curve is missing:
#Split data 70:30 after shuffling the data frame
index<-1:nrow(LDA.scores1)
trainindex.LDA3=sample(index, trunc(length(index)*0.70),replace=FALSE)
LDA.70.trainset3<-shuffle.cross.validation2[trainindex.LDA3,]
LDA.30.testset3<-shuffle.cross.validation2[-trainindex.LDA3,]
使用软件包rpart()运行分类树
Run classification tree using package rpart()
tree.split3<-rpart(Family~., data=LDA.70.trainset3, method="class")
tree.split3
summary(tree.split3)
print(tree.split3)
plot(tree.split3)
text(tree.split3,use.n=T,digits=0)
printcp(tree.split3)
tree.split3
预测预测数据和实际数据
Predict the predicted and actual data
res3=predict(tree.split3,newdata=LDA.30.testset3)
res4=as.data.frame(res3)
创建两个具有NA的列(实际和预测分类率)
Create two columns with NA's (Actual and predicted classification rate)
res4$predicted<-NA
res4$actual<-NA
for (i in 1:length(res4$G8)){
if(res4$R2[i]>res4$V4[i]) {
res4$predicted[i]<-"G8"
}
else {
res4$predicted[i]<-"V4"
}
print(i)
}
res4
res4$actual<-LDA.30.testset3$Family
res4
Risk.table$Risk<-NA
Risk.table
创建二进制预测值列
Create the binary predictor column
for (i in 1:length(Risk.table$Risk)){
if(Risk.table$predicted[i]==res4$actual[i]) {
Risk.table$Risk[i]<-1
}
else {
Risk.table$Risk[i]<-0
}
print(i)
}
为两个家庭创建预测概率和实际概率V4和G8以上
Creation of the predicted and actual probabilities for the two families V4 and G8 above
#Confusion Matrix
cm=table(res4$actual, res4$predicted)
names(dimnames(cm))=c("actual", "predicted")
朴素贝叶斯
Naive Bayes
index<-1:nrow(significant.lda.Wilks2)
trainindex.LDA.help1=sample(index, trunc(length(index)*0.70), replace=FALSE)
sig.train=significant.lda.Wilks2[trainindex.LDA.help1,]
sig.test=significant.lda.Wilks2[-trainindex.LDA.help1,]
library(klaR)
nbmodel<-NaiveBayes(Family~., data=sig.train)
prediction<-predict(nbmodel, sig.test)
NB<-as.data.frame(prediction)
colnames(NB)<-c("Actual", "Predicted.prob", "acual.prob")
NB$actual2 = NA
NB$actual2[NB$Actual=="G8"] = 1
NB$actual2[NB$Actual=="V4"] = 0
NB2<-as.data.frame(NB)
plot(fit.perf, col="red"); #Naive Bayes
plot(perf, col="blue", add=T); #Classification Tree
abline(0,1,col="green")
library(caret)
library(e1071)
train_control<-trainControl(method="repeatedcv", number=10, repeats=3)
model<-train(Matriline~., data=LDA.scores, trControl=train_control, method="nb")
predictions <- predict(model, LDA.scores[,2:13])
confusionMatrix(predictions,LDA.scores$Family)
结果
Results
Confusion Matrix and Statistics
Reference
Prediction V4 G8
V4 25 2
G8 5 48
Accuracy : 0.9125
95% CI : (0.828, 0.9641)
No Information Rate : 0.625
P-Value [Acc > NIR] : 4.918e-09
Kappa : 0.8095
Mcnemar's Test P-Value : 0.4497
Sensitivity : 0.8333
Specificity : 0.9600
Pos Pred Value : 0.9259
Neg Pred Value : 0.9057
Prevalence : 0.3750
Detection Rate : 0.3125
Detection Prevalence : 0.3375
Balanced Accuracy : 0.8967
'Positive' Class : V4
推荐答案
I有很多事情要指出:
I have various things to point out:
1)我认为您的代码必须是rpart中的 Family〜。
1) I think your code has to be Family ~ .
inside your rpart command.
2)在您的初始表中,您可以在预测列中看到值 W3
。这是否意味着您没有二进制因变量? ROC曲线适用于二进制数据,因此请对其进行检查。
2) In your initial table I can see a value W3
in your predicted column. Does that mean you don’t have a binary dependent variable? ROC curves work with binary data, so check it.
3)您在初始表中的预测概率和实际概率总和为1。这合理吗?我认为它们代表了别的东西,所以您可能会考虑更改名称,以防将来使您感到困惑。
3) Your predicted and actual probabilities in your initial table always sum to 1. Is that reasonable? I think they represent something else, so you might consider changing names in case they confuse you in the future.
4)我认为您对ROC的工作方式和方法感到困惑它需要什么输入。您的风险
列使用1代表正确的预测,使用0代表错误的预测。但是,ROC曲线需要1代表一个类别,而0需要代表另一个类别。简单来说,命令是 prediction(预测,标签)
,其中 predictions
是您的预测概率,而 labels
是您的因变量的真实类/级别。
检查以下代码:
4) I think you’re confused about how ROC works and what inputs it needs. Your Risk
column uses 1 to represent a correct prediction and 0 to represent a wrong prediction. However, the ROC curve needs 1 to represent one class and 0 to represent the other class. In simple words, the command is prediction(predictions, labels)
where predictions
are your predicted probabilities and labels
are the true class/levels of your dependent variable.
Check the following code:
dt = read.table(text="
Id Predicted.prob Actual.prob predicted actual Risk
1 0.5384615 0.4615385 G8 V4 0
2 0.1212121 0.8787879 V4 V4 1
3 0.5384615 0.4615385 G8 G8 1
4 0.9000000 0.1000000 G8 G8 1
5 0.1212121 0.8787879 V4 V4 1
6 0.1212121 0.8787879 V4 V4 1
7 0.9000000 0.1000000 G8 G8 1
8 0.5384615 0.4615385 G8 V4 0
9 0.5384615 0.4615385 G8 V4 0
10 0.1212121 0.8787879 V4 G8 0
11 0.1212121 0.8787879 V4 V4 1
12 0.9000000 0.1000000 G8 V4 0
13 0.9000000 0.1000000 G8 V4 0
14 0.1212121 0.8787879 W3 V4 1
15 0.9000000 0.1000000 G8 G8 1
16 0.5384615 0.4615385 G8 V4 0
17 0.9000000 0.1000000 G8 V4 0
18 0.1212121 0.8787879 V4 V4 1
19 0.5384615 0.4615385 G8 V4 0
20 0.1212121 0.8787879 V4 V4 1
21 0.9000000 0.1000000 G8 G8 1
22 0.5384615 0.4615385 G8 V4 0
23 0.9000000 0.1000000 G8 V4 0
24 0.1212121 0.8787879 V4 V4 1", header=T)
library(ROCR)
roc_pred <- prediction(dt$Predicted.prob, dt$Risk)
perf <- performance(roc_pred, "tpr", "fpr")
plot(perf, col="red")
abline(0,1,col="grey")
ROC曲线为:
创建新列时实际2
,其中1代替G8,0代替V4:
When you create a new column actual2
where you have 1 instead of G8 and 0 instead of V4:
dt$actual2 = NA
dt$actual2[dt$actual=="G8"] = 1
dt$actual2[dt$actual=="V4"] = 0
roc_pred <- prediction(dt$Predicted.prob, dt$actual2)
perf <- performance(roc_pred, "tpr", "fpr")
plot(perf, col="red")
abline(0,1,col="grey")
5)如上文@ eipi10所述,您应该尝试摆脱代码中的for循环。
5) As @eipi10 mentioned above, you should try to get rid of the for loops in your code.
这篇关于如何根据分类树概率绘制ROC曲线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!