如何从nltk.classify ClassifierI解决NotImplementedError? [英] How to solve a NotImplementedError from nltk.classify ClassifierI?

查看:90
本文介绍了如何从nltk.classify ClassifierI解决NotImplementedError?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是编程新手,但是一遍又一遍地查看我的代码,看不到任何错误.我不知道如何继续进行,因为无论我尝试什么,都会弹出此错误.我将在此处发布完整的代码.

I am new to programming, but have looked at my code over and over and can't see any mistakes. I don't know how proceed any more because this error pops up no matter what I try. I'll post the full code here.

任何帮助将不胜感激,谢谢!

Any help would be much appreciated, thank you!

import nltk
import random
from nltk.corpus import movie_reviews
import pickle
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode 

class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

        def classify(self, features):
            votes = []
            for c in self._classifiers:
                v = c.classify(features)
                votes.append(v)
            return mode(votes)


        def confidence(self, features):
            votes = []
            for c in self._classifiers:
                v = c.classify(features)
                votes.append(v)


            choice_votes = votes.count(mode(votes))
            conf = choice_votes / len(votes)
            return conf


documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []

for w in movie_reviews.words():
        all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

featuresets = [(find_features(rev), category) for (rev, category) in documents]

training_set = featuresets[:1900]
testing_set = featuresets[1900:]

# classifier = nltk.NaiveBayesClassifier.train(training_set)
classifier_f = open("naivebayes.pickle", "rb")
classifier = pickle.load(classifier_f)
classifier_f.close()

print("Original NaiveBayes accuracy percent:",(nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(10)

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)

SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)

##SVC_classifier = SklearnClassifier(SVC())
##SVC_classifier.train(training_set)
##print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, testing_set))*100)

LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)

NuSVC_classifier = SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)


voted_classifier = VoteClassifier(classifier,
                                  NuSVC_classifier,
                                  LinearSVC_classifier,
                                  SGDClassifier_classifier,
                                  MNB_classifier,
                                  BernoulliNB_classifier,
                                  LogisticRegression_classifier)

print("voted_classifier accuracy percent:", (nltk.classify.accuracy(voted_classifier, testing_set))*100)

我还尝试在顶部的类上引发NotImplementedError异常,但它并未更改Python中的输出.

这是错误:

I also tried raising a NotImplementedError exception on the class at the top but it did not change the output in Python.

This is the error:

Traceback (most recent call last):
  File "code/test.py", line 109, in <module>
    print("voted_classifier accuracy percent:", (nltk.classify.accuracy(voted_classifier, testing_set))*100)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/classify/util.py", line 87, in accuracy
    results = classifier.classify_many([fs for (fs, l) in gold])
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/classify/api.py", line 77, in classify_many
    return [self.classify(fs) for fs in featuresets]
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/classify/api.py", line 56, in classify
    raise NotImplementedError()
NotImplementedError

推荐答案

如注释中所述,ClassiferI api中有一些糟糕的意大利面条,如代码,当覆盖时,classify调用classify_many.考虑到ClassifierINaiveBayesClassifier对象紧密绑定,这可能不是一件坏事.

As noted in the comments, there's some bad spaghetti like code in the ClassiferI api that has classify calling classify_many when overriden. It might not be a bad thing when considering that the ClassifierI is strongly tied with the NaiveBayesClassifier object.

但是对于OP中的特殊用途,不欢迎使用意大利面条式代码.

But for the particular use in the OP, the spaghetti code there isn't welcomed.

看看 https://www.kaggle.com/alvations/sklearn -nltk-voteclassifier

从追溯开始,错误是从nltk.classify.util.accuracy()调用ClassifierI.classify()开始的.

From the traceback, the error is starts from nltk.classify.util.accuracy() calling the ClassifierI.classify().

ClassifierI.classify()通常用于对一个文档进行分类,输入是具有二进制值的要素集的字典.

The ClassifierI.classify() is generally used to classify ONE document and the input is a dictionary of featureset with its binary values.

应该将ClassifierI.classify_many()分类为多个文档,并且输入是具有其二进制值的功能集字典的列表.

The ClassifierI.classify_many() is supposed to classify a MULTIPLE documents and the input is a list of dictionary of featureset with its binary values.

因此,快速的技巧是改写accuracy()函数的方式,以使VotedClassifier不会依赖于classify()classify_many()ClassifierI定义.这也意味着我们不继承ClassifierI.恕我直言,如果您不需要classify()以外的其他功能,则无需继承ClassifierI可能附带的行李:

So the quick hack is to overwrite how the accuracy() function so that the VotedClassifier won't be dependent on the ClassifierI definition of classify() vs classify_many(). That would also mean that we don't inherit from ClassifierI. IMHO, if you don't need other functions other than classify(), there's no need to inherit the baggage that ClassifierI might come with:

def my_accuracy(classifier, gold):
    documents, labels = zip(*gold)
    predictions = classifier.classify_documents(documents)
    correct = [y == y_hat for y, y_hat in zip(labels, predictions)]
    if correct:
        return sum(correct) / len(correct)
    else:
        return 0

class VotraClassifier:
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify_documents(self, documents):
        return [self.classify_many(doc) for doc in documents]

    def classify_many(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

现在,如果我们使用新的VotedClassifier对象调用新的my_accuracy():

Now if we call the new my_accuracy() with the new VotedClassifier object:

voted_classifier = VotraClassifier(nltk_nb, 
                                  NuSVC_classifier,
                                  LinearSVC_classifier,
                                  SGDClassifier_classifier,
                                  MNB_classifier,
                                  BernoulliNB_classifier,
                                  LogisticRegression_classifier)

my_accuracy(voted_classifier, testing_set)

[输出]:

0.86


注意:在对文档进行混排,然后提供一组以测试分类器准确性时,存在一定的随机性.


Note: There's certain randomness when it comes to shuffling the document and then holding out a set to test for the classifier accuracy.

我的建议是执行以下操作,而不是简单的random.shuffle(documents)

My suggestion is to do the following instead of simple random.shuffle(documents)

  • 使用各种随机种子重复实验.
  • 对于每个随机种子,进行10倍交叉验证.

这篇关于如何从nltk.classify ClassifierI解决NotImplementedError?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆