如何在NLTK中使用混淆矩阵模块? [英] How to use the confusion matrix module in NLTK?

查看:95
本文介绍了如何在NLTK中使用混淆矩阵模块?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遵循NLTK手册,使用了混淆矩阵,但是混淆矩阵看起来很奇怪.

I followed the NLTK book in using the confusion matrix but the confusionmatrix looks very odd.

#empirically exam where tagger is making mistakes
test_tags = [tag for sent in brown.sents(categories='editorial')
    for (word, tag) in t2.tag(sent)]
gold_tags = [tag for (word, tag) in brown.tagged_words(categories='editorial')]
print nltk.ConfusionMatrix(gold_tags, test_tags)

任何人都可以解释如何使用混淆矩阵吗?

Can anyone explain how to use the confusion matrix?

推荐答案

首先,我假设您从旧的NLTK的第05章获得了代码: http://pastebin.com/EC8fFqLU

Firstly, I assume that you got the code from old NLTK's chapter 05: https://nltk.googlecode.com/svn/trunk/doc/book/ch05.py, particularly you're look at this section: http://pastebin.com/EC8fFqLU

现在,让我们看一下NLTK中的混淆矩阵,尝试:

Now, let's look at the confusion matrix in NLTK, try:

from nltk.metrics import ConfusionMatrix
ref  = 'DET NN VB DET JJ NN NN IN DET NN'.split()
tagged = 'DET VB VB DET NN NN NN IN DET NN'.split()
cm = ConfusionMatrix(ref, tagged)
print cm

[输出]:

    | D         |
    | E I J N V |
    | T N J N B |
----+-----------+
DET |<3>. . . . |
 IN | .<1>. . . |
 JJ | . .<.>1 . |
 NN | . . .<3>1 |
 VB | . . . .<1>|
----+-----------+
(row = reference; col = test)

<>中嵌入的数字是真实的正数(tp).从上面的示例中,您可以看到引用的JJ之一被错误地标记为NN.在这种情况下,对于NN,它算作一个假阳性,对于JJ,它算作一个假阴性.

The numbers embedded in <> are the true positives (tp). And from the example above, you see that one of the JJ from reference was wrongly tagged as NN from the tagged output. For that instance, it counts as one false positive for NN and one false negative for JJ.

要访问混淆矩阵(用于计算精度/召回率/fscore),可以通过以下方式访问假阴性,假阳性和真阳性:

To access the confusion matrix (for calculating precision/recall/fscore), you can access the false negatives, false positives and true positives by:

labels = set('DET NN VB IN JJ'.split())

true_positives = Counter()
false_negatives = Counter()
false_positives = Counter()

for i in labels:
    for j in labels:
        if i == j:
            true_positives[i] += cm[i,j]
        else:
            false_negatives[i] += cm[i,j]
            false_positives[j] += cm[i,j]

print "TP:", sum(true_positives.values()), true_positives
print "FN:", sum(false_negatives.values()), false_negatives
print "FP:", sum(false_positives.values()), false_positives

[输出]:

TP: 8 Counter({'DET': 3, 'NN': 3, 'VB': 1, 'IN': 1, 'JJ': 0})
FN: 2 Counter({'NN': 1, 'JJ': 1, 'VB': 0, 'DET': 0, 'IN': 0})
FP: 2 Counter({'VB': 1, 'NN': 1, 'DET': 0, 'JJ': 0, 'IN': 0})

要计算每个标签的Fscore:

To calculate Fscore per label:

for i in sorted(labels):
    if true_positives[i] == 0:
        fscore = 0
    else:
        precision = true_positives[i] / float(true_positives[i]+false_positives[i])
        recall = true_positives[i] / float(true_positives[i]+false_negatives[i])
        fscore = 2 * (precision * recall) / float(precision + recall)
    print i, fscore

[输出]:

DET 1.0
IN 1.0
JJ 0
NN 0.75
VB 0.666666666667

我希望以上内容可以消除NLTK中的混淆矩阵用法,这是上面示例的完整代码:

I hope the above will de-confuse the confusion matrix usage in NLTK, here's the full code for the example above:

from collections import Counter
from nltk.metrics import ConfusionMatrix

ref  = 'DET NN VB DET JJ NN NN IN DET NN'.split()
tagged = 'DET VB VB DET NN NN NN IN DET NN'.split()
cm = ConfusionMatrix(ref, tagged)

print cm

labels = set('DET NN VB IN JJ'.split())

true_positives = Counter()
false_negatives = Counter()
false_positives = Counter()

for i in labels:
    for j in labels:
        if i == j:
            true_positives[i] += cm[i,j]
        else:
            false_negatives[i] += cm[i,j]
            false_positives[j] += cm[i,j]

print "TP:", sum(true_positives.values()), true_positives
print "FN:", sum(false_negatives.values()), false_negatives
print "FP:", sum(false_positives.values()), false_positives
print 

for i in sorted(labels):
    if true_positives[i] == 0:
        fscore = 0
    else:
        precision = true_positives[i] / float(true_positives[i]+false_positives[i])
        recall = true_positives[i] / float(true_positives[i]+false_negatives[i])
        fscore = 2 * (precision * recall) / float(precision + recall)
    print i, fscore

这篇关于如何在NLTK中使用混淆矩阵模块?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆