特征选择+交叉验证,但是如何在R中制作ROC曲线 [英] Feature selection + cross-validation, but how to make ROC-curves in R

查看:594
本文介绍了特征选择+交叉验证,但是如何在R中制作ROC曲线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了下一个问题。我将数据分成10折。每次,我使用1倍作为测试集,而其他9倍作为训练集(我这样做10次)。在每个训练集上,我都进行特征选择(带有chi.squared的filter methode),然后用训练集和所选特征制作一个SVMmodel。
最后,我变成了10个不同的模型( (因为功能选择)。但是现在,我通常想通过这种过滤方法在R中制作ROC曲线。我该怎么办?

I'm stuck with the next problem. I divide my data into 10 folds. Each time, I use 1 fold as test set and the other 9 as training set (I do this ten times). On each training set, I do feature selection (filter methode with chi.squared) and then I make a SVMmodel with my training set and the selected features.
So at the end, I become 10 different models (because of the feature selection). But now I want to make a ROC-curve in R from this filter methode in general. How can I do this?

Silke

推荐答案

您确实可以如果所有预测都在相同范围内,则将其存储()(在执行特征选择时要特别小心……某些方法可能会产生取决于特征数量的分数),并使用它们来建立ROC曲线。这是我用于最近的论文

You can indeed store the predictions if they are all on the same scale (be especially careful about this as you perform feature selection... some methods may produce scores that are dependent on the number of features) and use them to build a ROC curve. Here is the code I used for a recent paper:

library(pROC)
data(aSAH)
k <- 10
n <- dim(aSAH)[1]
indices <- sample(rep(1:k, ceiling(n/k))[1:n])

all.response <- all.predictor <- aucs <- c()
for (i in 1:k) {
  test = aSAH[indices==i,]
  learn = aSAH[indices!=i,]
  model <- glm(as.numeric(outcome)-1 ~ s100b + ndka + as.numeric(wfns), data = learn, family=binomial(link = "logit"))
  model.pred <- predict(model, newdata=test)
  aucs <- c(aucs, roc(test$outcome, model.pred)$auc)
  all.response <- c(all.response, test$outcome)
  all.predictor <- c(all.predictor, model.pred)
}

roc(all.response, all.predictor)
mean(aucs)

roc曲线由全部构建.response all.predictor d在每一步。此代码还将每个步骤的AUC存储在 auc 中以进行比较。当样本量足够大时,两个结果应该非常相似,但是交叉验证内的小样本可能会导致AUC的低估,因为带有所有数据的ROC曲线将趋于平滑,而梯形法则的估计则较小。

The roc curve is built from all.response and all.predictor that are updated at each step. This code also stores the AUC at each step in auc for comparison. Both results should be quite similar when the sample size is sufficiently large, however small samples within the cross-validation may lead to underestimated AUC as the ROC curve with all data will tend to be smoother and less underestimated by the trapezoidal rule.

这篇关于特征选择+交叉验证,但是如何在R中制作ROC曲线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆