如何根据分类树概率绘制ROC曲线 [英] How to plot a ROC curve from Classification Tree probabilities

查看:421
本文介绍了如何根据分类树概率绘制ROC曲线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试绘制具有分类树概率的ROC曲线。但是,当我绘制曲线时,它不存在。我试图绘制ROC曲线,然后从曲线下的区域中找到AUC值。有谁知道如何解决这一问题?谢谢,如果可以的话。二进制列Risk代表风险分类错误,我认为这是我的标签。我应该在代码的其他位置应用ROC曲线方程吗?

I am attempting to plot a ROC curve with classification trees probabilities. However, when I plot the curve, it is absent. I am trying to plot the ROC curve and then find the AUC value from the area under the curve. Does anyone know how to fix this? Thank you if you can. The binary column Risk stands for risk misclassification, which I presume is my label. Should I be applying the ROC curve equation at a different point in my code?

以下是数据框:

   library(ROCR)

   data(Risk.table)

   pred = prediction(Risk.table$Predicted.prob, Risk.table2$Risk)
   perf = performance(pred, measure="tpr", x.measure="fpr")
   perf
   plot(perf)

   Predicted.prob Actual.prob   predicted actual Risk
  1       0.5384615   0.4615385        G8     V4    0
  2       0.1212121   0.8787879        V4     V4    1
  3       0.5384615   0.4615385        G8     G8    1
  4       0.9000000   0.1000000        G8     G8    1
  5       0.1212121   0.8787879        V4     V4    1
  6       0.1212121   0.8787879        V4     V4    1
  7       0.9000000   0.1000000        G8     G8    1
  8       0.5384615   0.4615385        G8     V4    0
  9       0.5384615   0.4615385        G8     V4    0
  10      0.1212121   0.8787879        V4     G8    0
  11      0.1212121   0.8787879        V4     V4    1
  12      0.9000000   0.1000000        G8     V4    0
  13      0.9000000   0.1000000        G8     V4    0
  14      0.1212121   0.8787879        G8     V4    1
  15      0.9000000   0.1000000        G8     G8    1
  16      0.5384615   0.4615385        G8     V4    0
  17      0.9000000   0.1000000        G8     V4    0
  18      0.1212121   0.8787879        V4     V4    1
  19      0.5384615   0.4615385        G8     V4    0
  20      0.1212121   0.8787879        V4     V4    1
  21      0.9000000   0.1000000        G8     G8    1
  22      0.5384615   0.4615385        G8     V4    0
  23      0.9000000   0.1000000        G8     V4    0
  24      0.1212121   0.8787879        V4     V4    1



以下是此代码输出的ROC曲线,但该曲线缺失:



Here is the ROC curve this code outputs, but the curve is missing:

  #Split data 70:30 after shuffling the data frame

  index<-1:nrow(LDA.scores1)
  trainindex.LDA3=sample(index, trunc(length(index)*0.70),replace=FALSE)      

  LDA.70.trainset3<-shuffle.cross.validation2[trainindex.LDA3,]

  LDA.30.testset3<-shuffle.cross.validation2[-trainindex.LDA3,]



使用软件包rpart()运行分类树



Run classification tree using package rpart()

 tree.split3<-rpart(Family~., data=LDA.70.trainset3, method="class")
 tree.split3
 summary(tree.split3)
 print(tree.split3)
 plot(tree.split3)
 text(tree.split3,use.n=T,digits=0)
 printcp(tree.split3)
 tree.split3



预测预测数据和实际数据



Predict the predicted and actual data

 res3=predict(tree.split3,newdata=LDA.30.testset3)
 res4=as.data.frame(res3)



创建两个具有NA的列(实际和预测分类率)



Create two columns with NA's (Actual and predicted classification rate)

 res4$predicted<-NA
 res4$actual<-NA


 for (i in 1:length(res4$G8)){

 if(res4$R2[i]>res4$V4[i]) {
 res4$predicted[i]<-"G8"
 }

 else {
 res4$predicted[i]<-"V4"
 }

  print(i)
 }

 res4

 res4$actual<-LDA.30.testset3$Family
 res4
 Risk.table$Risk<-NA
 Risk.table



创建二进制预测值列



Create the binary predictor column

  for (i in 1:length(Risk.table$Risk)){

  if(Risk.table$predicted[i]==res4$actual[i]) {
  Risk.table$Risk[i]<-1
  }

  else {
  Risk.table$Risk[i]<-0
  }

  print(i)
  }



为两个家庭创建预测概率和实际概率V4和G8以上



Creation of the predicted and actual probabilities for the two families V4 and G8 above

    #Confusion Matrix

    cm=table(res4$actual, res4$predicted)

    names(dimnames(cm))=c("actual", "predicted")



朴素贝叶斯



Naive Bayes

  index<-1:nrow(significant.lda.Wilks2)
  trainindex.LDA.help1=sample(index, trunc(length(index)*0.70), replace=FALSE)                                     
  sig.train=significant.lda.Wilks2[trainindex.LDA.help1,]
  sig.test=significant.lda.Wilks2[-trainindex.LDA.help1,]


    library(klaR)
    nbmodel<-NaiveBayes(Family~., data=sig.train)
    prediction<-predict(nbmodel, sig.test)
    NB<-as.data.frame(prediction)
    colnames(NB)<-c("Actual", "Predicted.prob", "acual.prob")

    NB$actual2 = NA
    NB$actual2[NB$Actual=="G8"] = 1
    NB$actual2[NB$Actual=="V4"] = 0
    NB2<-as.data.frame(NB)

    plot(fit.perf, col="red"); #Naive Bayes
    plot(perf, col="blue", add=T); #Classification Tree
    abline(0,1,col="green")

     library(caret)
     library(e1071)

  train_control<-trainControl(method="repeatedcv", number=10, repeats=3)
  model<-train(Matriline~., data=LDA.scores, trControl=train_control,    method="nb")
  predictions <- predict(model, LDA.scores[,2:13])
  confusionMatrix(predictions,LDA.scores$Family)



结果



Results

               Confusion Matrix and Statistics

                        Reference
                Prediction V4 G8
                        V4 25  2
                        G8  5 48

                  Accuracy : 0.9125         
                    95% CI : (0.828, 0.9641)
       No Information Rate : 0.625          
       P-Value [Acc > NIR] : 4.918e-09      

                    Kappa : 0.8095         
   Mcnemar's Test P-Value : 0.4497         

              Sensitivity : 0.8333         
              Specificity : 0.9600         
           Pos Pred Value : 0.9259         
           Neg Pred Value : 0.9057         
               Prevalence : 0.3750         
           Detection Rate : 0.3125         
     Detection Prevalence : 0.3375         
        Balanced Accuracy : 0.8967         

         'Positive' Class : V4         


推荐答案

I有很多事情要指出:

I have various things to point out:

1)我认为您的代码必须是rpart中的 Family〜。

1) I think your code has to be Family ~ . inside your rpart command.

2)在您的初始表中,您可以在预测列中看到值 W3 。这是否意味着您没有二进制因变量? ROC曲线适用于二进制数据,因此请对其进行检查。

2) In your initial table I can see a value W3 in your predicted column. Does that mean you don’t have a binary dependent variable? ROC curves work with binary data, so check it.

3)您在初始表中的预测概率和实际概率总和为1。这合理吗?我认为它们代表了别的东西,所以您可能会考虑更改名称,以防将来使您感到困惑。

3) Your predicted and actual probabilities in your initial table always sum to 1. Is that reasonable? I think they represent something else, so you might consider changing names in case they confuse you in the future.

4)我认为您对ROC的工作方式和方法感到困惑它需要什么输入。您的风险列使用1代表正确的预测,使用0代表错误的预测。但是,ROC曲线需要1代表一个类别,而0需要代表另一个类别。简单来说,命令是 prediction(预测,标签),其中 predictions 是您的预测概率,而 labels 是您的因变量的真实类/级别。
检查以下代码:

4) I think you’re confused about how ROC works and what inputs it needs. Your Risk column uses 1 to represent a correct prediction and 0 to represent a wrong prediction. However, the ROC curve needs 1 to represent one class and 0 to represent the other class. In simple words, the command is prediction(predictions, labels) where predictions are your predicted probabilities and labels are the true class/levels of your dependent variable. Check the following code:

dt = read.table(text="
Id Predicted.prob Actual.prob   predicted actual Risk
1       0.5384615   0.4615385        G8     V4    0
2       0.1212121   0.8787879        V4     V4    1
3       0.5384615   0.4615385        G8     G8    1
4       0.9000000   0.1000000        G8     G8    1
5       0.1212121   0.8787879        V4     V4    1
6       0.1212121   0.8787879        V4     V4    1
7       0.9000000   0.1000000        G8     G8    1
8       0.5384615   0.4615385        G8     V4    0
9       0.5384615   0.4615385        G8     V4    0
10      0.1212121   0.8787879        V4     G8    0
11      0.1212121   0.8787879        V4     V4    1
12      0.9000000   0.1000000        G8     V4    0
13      0.9000000   0.1000000        G8     V4    0
14      0.1212121   0.8787879        W3     V4    1
15      0.9000000   0.1000000        G8     G8    1
16      0.5384615   0.4615385        G8     V4    0
17      0.9000000   0.1000000        G8     V4    0
18      0.1212121   0.8787879        V4     V4    1
19      0.5384615   0.4615385        G8     V4    0
20      0.1212121   0.8787879        V4     V4    1
21      0.9000000   0.1000000        G8     G8    1
22      0.5384615   0.4615385        G8     V4    0
23      0.9000000   0.1000000        G8     V4    0
24      0.1212121   0.8787879        V4     V4    1", header=T)

library(ROCR)

roc_pred <- prediction(dt$Predicted.prob, dt$Risk)
perf <- performance(roc_pred, "tpr", "fpr")
plot(perf, col="red")
abline(0,1,col="grey")

ROC曲线为:

创建新列时实际2 ,其中1代替G8,0代替V4:

When you create a new column actual2 where you have 1 instead of G8 and 0 instead of V4:

dt$actual2 = NA
dt$actual2[dt$actual=="G8"] = 1
dt$actual2[dt$actual=="V4"] = 0

roc_pred <- prediction(dt$Predicted.prob, dt$actual2)
perf <- performance(roc_pred, "tpr", "fpr")
plot(perf, col="red")
abline(0,1,col="grey")

5)如上文@ eipi10所述,您应该尝试摆脱代码中的for循环。

5) As @eipi10 mentioned above, you should try to get rid of the for loops in your code.

这篇关于如何根据分类树概率绘制ROC曲线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆