有谁知道如何根据前提条件生成AUC / Roc区域? [英] Does anyone know how to generate AUC/Roc Area based on the predition?

查看:97
本文介绍了有谁知道如何根据前提条件生成AUC / Roc区域?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道AUC / ROC区域( http://weka.wikispaces.com/ weka中的Area + under + the + curve )基于e Mann Whitney统计信息( http://en.wikipedia.org/wiki/Mann-Whitney_U

I know the AUC/ROC area (http://weka.wikispaces.com/Area+under+the+curve) in weka is based on the e Mann Whitney statistic (http://en.wikipedia.org/wiki/Mann-Whitney_U)

但是我的疑问是,如果我有10个标记实例(是或N,即二进制目标属性),通过对数据集应用算法(即J48),则在这10个实例上有10个预测标签。那么,究竟应该使用什么来计算AUC_Y,AUC_N和AUC_Avg?使用预测的排名标签Y和N还是实际标签(Y'和N')?还是我需要计算TP率和FP率?

But my doubt is, if I've got 10 labeled instances (Y or N, binary target attribute), by applying an algorithm (i.e. J48) onto the dataset, then there are 10 predicted labels on these 10 instances. Then what exactly should I use to calculate the AUC_Y, AUC_N, and AUC_Avg? Use the prediction's ranked label Y and N or the actual label (Y' and N')? Or I need to calculate the TP rate and FP rate?

谁能给我一个小例子,并指出我应该使用哪些数据来计算基于Mann的AUC惠特尼统计方法?

Can anyone give me a small example and point me to what data should I use to calculate the AUC based on Mann Whitney statistic approach? Thanks in advanced.

样本数据:

inst#    actual predicted  error   PrY     PrN
1        1:y        1:y          *0.973   0.027
2        1:y        1:y          *0.999   0.001
3        2:n        1:y      +   *0.568   0.432
4        2:n        2:n           0.382  *0.618
5        1:y        2:n      +    0.421  *0.579
6        2:n        2:n           0.146  *0.854
7        1:y        1:y          *1       0    
8        1:y        1:y          *0.999   0.001
9        2:n        2:n           0.11   *0.89 
10       1:y        2:n      +    0.377  *0.623


推荐答案

计算AUC的依据是对结果进行排名。我刚刚阅读了Mann-Whitney-U统计信息,我认为这基本上就是我一直在代码中执行此操作的方式。

Calculating the AUC is based on ranking your results. I've just read up on the Mann-Whitney-U statistic and I think it is basically how I do it in my code all the time.

首先,您需要一些东西来排名您的结果。通常,这是分类器的决策值(例如, SVM与超平面的距离),但WEKA主要使用类别概率。在您的示例中,PrY和PrN的总和为1,这很好,因此您可以选择一个,例如PrY。

First, you need something to rank your results. Usually, this is the decision value of your classifier (e.g. distance to the hyperplane for SVMs), but WEKA mostly uses the class probability. In your example, PrY and PrN sum up to 1, which is good, so you can pick either one, say PrY.

然后,您可以根据PrN对实例进行排名:

You then rank your instances by PrN:

inst#    actual predicted  error   PrY     PrN
7        1:y        1:y          *1       0    
8        1:y        1:y          *0.999   0.001
2        1:y        1:y          *0.999   0.001
1        1:y        1:y          *0.973   0.027
3        2:n        1:y      +   *0.568   0.432
5        1:y        2:n      +    0.421  *0.579
4        2:n        2:n           0.382  *0.618
10       1:y        2:n      +    0.377  *0.623
6        2:n        2:n           0.146  *0.854
9        2:n        2:n           0.11   *0.89 

根据Wikipedia关于Mann-Whitney-U统计的说法,您现在需要总结每个 actual 类,该类被另一类殴打的频率。对于正实例(y),这将是

From what Wikipedia says about the Mann-Whitney-U statistic, you now need to sum up for each actual class, how often it is "beaten" by the other class. For the positive instances (y), this would be

0, 0, 0, 0, 1, 2 => Sum: 3

,对于否定情况(n)

4, 5, 6, 6 => Sum: 21

因此U_y = 3且U_n = 21,对其进行检查:

So U_y = 3 and U_n = 21, checking it:

U_y + U_n = 24 = 6 * 4 = #y * #n

AUC_y将会是(在维基百科

AUC_y then would be (after wikipedia)

AUC_y = U_y / (#y * #n) = 3 / 24 = 0.125
AUC_n = U_n / (#y * #n) = 21 / 24 = 0.875

现在,在这种情况下,我坚信AUC_n是您想要的AUC。我们按升序对PrN进行了排序,因此AUC_n是我们想要的。

Now, in this case I strongly believe that AUC_n is the AUC you want. We sorted for PrN in ascending order, so AUC_n is what we want.

对我们刚刚执行的操作的更直观的图形化描述是:

A more intuitive and graphical description of what we just did is this:

我们根据实例的决策值/类概率对它们进行排序。如果我们按PrN升序排列,则正面应该优先。 (如果按PrY升序排序,则负数应首先出现。)现在我们绘制一个从坐标(0,0)开始的图。每当我们遇到一个实际的积极实例时,我们就画一个单位。每次遇到否定实例时,我们都会向右划一个单位。现在,此行分隔为ASCII艺术作品中的区域(我将尽快用一个体面的图像替换它):

We sort our instances by their decision value / class probability. If we sort ascending by PrN, the positive ones should come first. (If we sort ascending by PrY, the negative ones should come first.) Now we draw a plot, beginning at coordinates (0,0). Everytime we encounter an actual positive instance, we draw one unit up. Everytime we encounter a negative instance, we draw one unit right. This line now separates to areas, which look like this in ASCII art (I'll replace it with a decent image as soon as I can):

|..##|
|.###|
|####|
|####|
|####|
|####|

分隔线是ROC及其下方的区域(因此称为AUC)。这里的AUC是21个单位,我们需要通过将其除以24的总面积进行归一化,得出21/24 = 0.875

The separating line is the ROC and the area under it (hence the name) the AUC. The AUC here is 21 units, which we need to normalize by dividing it by the total area of 24, yielding 21/24 = 0.875

您还可以进行整个计算已经标准化,相当于将其绘制为FPR与TPR。

You can also do the whole calculation already normalized, which is equivalent to plotting it as FPR vs TPR.

这篇关于有谁知道如何根据前提条件生成AUC / Roc区域?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆