在 scikit-learn 中使用交叉验证时绘制 Precision-Recall 曲线 [英] Plotting Precision-Recall curve when using cross-validation in scikit-learn

查看:94
本文介绍了在 scikit-learn 中使用交叉验证时绘制 Precision-Recall 曲线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用交叉验证来评估带有 scikit-learn 的分类器的性能,并且我想绘制 Precision-Recall 曲线.我在 scikit-learn 的网站上找到了 an example 来绘制 PR 曲线,但是它不使用交叉验证进行评估.

I'm using cross-validation to evaluate the performance of a classifier with scikit-learn and I want to plot the Precision-Recall curve. I found an example on scikit-learn`s website to plot the PR curve but it doesn't use cross validation for the evaluation.

如何在使用交叉验证时在 scikit learn 中绘制 Precision-Recall 曲线?

How can I plot the Precision-Recall curve in scikit learn when using cross-validation?

我做了以下操作,但我不确定这是否是正确的方法(伪代码):

I did the following but i'm not sure if it's the correct way to do it (psudo code):

for each k-fold:

   precision, recall, _ =  precision_recall_curve(y_test, probs)
   mean_precision += precision
   mean_recall += recall

mean_precision /= num_folds
mean_recall /= num_folds

plt.plot(recall, precision)

你怎么看?

它不起作用,因为每次折叠后 precisionrecall 数组的大小不同.

it doesn't work because the size of precision and recall arrays are different after each fold.

有人吗?

推荐答案

不是在每次折叠后记录准确率和召回值,而是在每次折叠后在测试样本上存储预测.接下来,收集所有测试(即袋外)预测并计算准确率和召回率.

Instead of recording the precision and recall values after each fold, store the predictions on the test samples after each fold. Next, collect all the test (i.e. out-of-bag) predictions and compute precision and recall.

 ## let test_samples[k] = test samples for the kth fold (list of list)
 ## let train_samples[k] = test samples for the kth fold (list of list)

 for k in range(0, k):
      model = train(parameters, train_samples[k])
      predictions_fold[k] = predict(model, test_samples[k])

 # collect predictions
 predictions_combined = [p for preds in predictions_fold for p in preds]

 ## let predictions = rearranged predictions s.t. they are in the original order

 ## use predictions and labels to compute lists of TP, FP, FN
 ## use TP, FP, FN to compute precisions and recalls for one run of k-fold cross-validation

在一次完整的 k 折交叉验证中,预测器对每个样本进行一次且仅一次预测.给定 n 个样本,您应该有 n 个测试预测.

Under a single, complete run of k-fold cross-validation, the predictor makes one and only one prediction for each sample. Given n samples, you should have n test predictions.

(注意:这些预测与训练预测不同,因为预测器对每个样本进行预测,而之前并没有看到它.)

(Note: These predictions are different from training predictions, because the predictor makes the prediction for each sample without having been previously seen it.)

除非您使用留一法交叉验证,否则 k 折交叉验证通常需要对数据进行随机分区.理想情况下,您应该进行重复(和分层)k 折交叉验证.然而,组合来自不同轮次的精确召回曲线并不是直接的,因为您不能在精确召回点之间使用简单的线性插值,这与 ROC 不同(参见 Davis 和 Goadrich 2006).

Unless you are using leave-one-out cross-validation, k-fold cross validation generally requires a random partitioning of the data. Ideally, you would do repeated (and stratified) k-fold cross validation. Combining precision-recall curves from different rounds, however, is not straight forward, since you cannot use simple linear interpolation between precision-recall points, unlike ROC (See Davis and Goadrich 2006).

我个人使用 Davis-Goadrich 方法计算了 AUC-PR 在 PR 空间中进行插值(然后是数值积分),并使用来自重复分层 10 折交叉的 AUC-PR 估计值比较分类器验证.

I personally calculated AUC-PR using the Davis-Goadrich method for interpolation in PR space (followed by numerical integration) and compared the classifiers using the AUC-PR estimates from repeated stratified 10-fold cross validation.

为了绘制漂亮的图,我展示了其中一轮交叉验证的代表性 PR 曲线.

For a nice plot, I showed a representative PR curve from one of the cross-validation rounds.

当然,还有许多其他方法可以评估分类器性能,具体取决于数据集的性质.

There are, of course, many other ways of assessing classifier performance, depending on the nature of your dataset.

例如,如果数据集中(二进制)标签的比例不偏斜(即大约 50-50),您可以使用带有交叉验证的更简单的 ROC 分析:

For instance, if the proportion of (binary) labels in your dataset is not skewed (i.e. it is roughly 50-50), you could use the simpler ROC analysis with cross-validation:

从每个折叠中收集预测并构建 ROC 曲线(和以前一样),收集所有 TPR-FPR 点(即取所有 TPR-FPR 元组的并集),然后用可能的平滑绘制组合点集.或者,使用简单的线性插值和用于数值积分的复合梯形方法计算 AUC-ROC.

Collect predictions from each fold and construct ROC curves (as before), collect all the TPR-FPR points (i.e. take the union of all TPR-FPR tuples), then plot the combined set of points with possible smoothing. Optionally, compute AUC-ROC using simple linear interpolation and the composite trapezoid method for numerical integration.

这篇关于在 scikit-learn 中使用交叉验证时绘制 Precision-Recall 曲线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆