Weka-二进制分类提供极化/有偏结果 [英] Weka - binary classification giving polarized/biased results

查看:77
本文介绍了Weka-二进制分类提供极化/有偏结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先让我说,我是WEKA新手。

Let me say, first up, that I'm a WEKA newbie.

我正在使用WEKA来解决二进制分类问题,在该问题中某些指标已被用来得到实例的是/否答案。

I'm using WEKA for a binary classification problem where certain metrics are being used to get a yes/no answer for the instances.

为了说明这个问题,这是我为一组包含288个实例,190个是和98个实例的混淆矩阵使用BayesNet的'no'值:

To exemplify the issue, here's the confusion matrix I got for a set with 288 instances, with 190 'yes' and 98 'no' values using BayesNet:

  a   b   <-- classified as
190   0 |   a = yes
 98   0 |   b = no

其他一些分类器也是如此,但并非全部他们。也就是说,即使分类器的值没有两极分化,但它们对主要类别确实有一定的偏见。例如,以下是RandomForest的结果:

This absolute separation is the case with some other classifiers as well, but not with all of them. That said, even if classifiers don't have values polarized to such a degree, they do have a definite bias for the predominant class. For example, here's the result with RandomForest:

  a   b   <-- classified as
164  34 |   a = yes
 62  28 |   b = no

我敢肯定我遗漏了一些非常明显的东西。

I'm pretty certain I'm missing something very obvious.

推荐答案

最初,我认为BayesNet是问题所在。但是,现在我认为这是您的数据。

Originally, I thought that BayesNet is the problem. But now I think it is your data.

正如评论中已经指出的那样,我认为问题出在类的不平衡上。大多数分类器会针对准确性进行优化,在您的情况下,对于BayesNet,精度为(190 + 0)/ 288 = 0.66 ,对于(164 + 28)/ 288 = 0.67 for the RandomForest。

As it was already pointed out in the comments, I thought the problem is with the unbalanced classes. Most classifiers optimize for accuracy, which in your case is (190 + 0) / 288 = 0.66 for the BayesNet and (164 + 28) / 288 = 0.67 for the RandomForest.

您可以看到,两者的差别并不大,但是RandomForest找到的解决方案略胜一筹。它看起来更好,因为它不能将所有内容都放在同一个类中,但是我真的怀疑它在统计上是否有意义。

As you can see, the difference is not that big, but the solution found by RandomForest is marginally better. It looks "better" because it doesn't put everything in the same class, but I really doubt it is statistically significant.

像Lars Kotthoff提到的那样,很难说。我还猜想这些功能还不足以更好地实现更好的分离。

Like Lars Kotthoff mentioned, it is hard to say. I'd also guess that the features are just not good enough for a better separation.

除了尝试其他分类器之外,您还应该重新考虑性能指标。 准确性仅在每个类的实例数量大致相同时才有效。在其他情况下, MCC AUC 是不错的选择(但由于实现不兼容,AUC无法在WEKA中与LibSVM一起使用)。

In addition to trying other classifiers you should reconsider your performance measure. Accuracy is only good if you have approximately the same number of instances for each class. In other cases, MCC or AUC are good choices (but AUC won't work with LibSVM in WEKA due to incompatible implementations).

您的示例的我的客户中心的贝叶斯网络的费用为0,

The MCC for your examples would be 0 for the BayesNet and

  ((164*28) - (62*34)) / sqrt((164+62)*(34+28)*(164+34)*(62+28))
= (4592 - 2108) / sqrt(226 * 62 * 198 * 90)
= 2484 / sqrt(249693840)
= 0,15719823927071640929

适用于RandomForest。因此,RandomForest的结果略好一些,但并没有好得多。

for RandomForest. So RandomForest shows a slightly better result, but not that much better.

很难说不看您的数据,但是它们可能无法很好地分离。

Hard to tell without seeing your data, but they are probably not well separable.

这篇关于Weka-二进制分类提供极化/有偏结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆