NotFittedError: TfidfVectorizer - 未安装词汇 [英] NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted

查看:28
本文介绍了NotFittedError: TfidfVectorizer - 未安装词汇的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 scikit-learn/pandas 构建一个情绪分析器.构建和评估模型有效,但尝试对新样本文本进行分类则无效.

I am trying to build a sentiment analyzer using scikit-learn/pandas. Building and evaluating the model works, but attempting to classify new sample text does not.

我的代码:

import csv
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

infile = 'Sentiment_Analysis_Dataset.csv'
data = "SentimentText"
labels = "Sentiment"


class Classifier():
    def __init__(self):
        self.train_set, self.test_set = self.load_data()
        self.counts, self.test_counts = self.vectorize()
        self.classifier = self.train_model()

    def load_data(self):

        df = pd.read_csv(infile, header=0, error_bad_lines=False)
        train_set, test_set = train_test_split(df, test_size=.3)
        return train_set, test_set

    def train_model(self):
        classifier = BernoulliNB()
        targets = self.train_set[labels]
        classifier.fit(self.counts, targets)
        return classifier


    def vectorize(self):

        vectorizer = TfidfVectorizer(min_df=5,
                                 max_df = 0.8,
                                 sublinear_tf=True,
                                 ngram_range = (1,2),
                                 use_idf=True)
        counts = vectorizer.fit_transform(self.train_set[data])
        test_counts = vectorizer.transform(self.test_set[data])

        return counts, test_counts

    def evaluate(self):
        test_counts,test_set = self.test_counts, self.test_set
        predictions = self.classifier.predict(test_counts)
        print (classification_report(test_set[labels], predictions))
        print ("The accuracy score is {:.2%}".format(accuracy_score(test_set[labels], predictions)))


    def classify(self, input):
        input_text = input

        input_vectorizer = TfidfVectorizer(min_df=5,
                                 max_df = 0.8,
                                 sublinear_tf=True,
                                 ngram_range = (1,2),
                                 use_idf=True)
        input_counts = input_vectorizer.transform(input_text)
        predictions = self.classifier.predict(input_counts)
        print(predictions)

myModel = Classifier()

text = ['I like this I feel good about it', 'give me 5 dollars']

myModel.classify(text)
myModel.evaluate()

错误:

Traceback (most recent call last):
  File "sentiment.py", line 74, in <module>
    myModel.classify(text)
  File "sentiment.py", line 66, in classify
    input_counts = input_vectorizer.transform(input_text)
  File "/home/rachel/Sentiment/ENV/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 1380, in transform
    X = super(TfidfVectorizer, self).transform(raw_documents)
  File "/home/rachel/Sentiment/ENV/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 890, in transform
    self._check_vocabulary()
  File "/home/rachel/Sentiment/ENV/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 278, in _check_vocabulary
    check_is_fitted(self, 'vocabulary_', msg=msg),
  File "/home/rachel/Sentiment/ENV/lib/python3.5/site-packages/sklearn/utils/validation.py", line 690, in check_is_fitted
    raise _NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.exceptions.NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.

我不确定可能是什么问题.在我的分类方法中,我创建了一个全新的向量化器来处理我想要分类的文本,与用于从模型创建训练和测试数据的向量化器分开.

I'm not sure what the issue could be. In my classify method, I create a brand new vectorizer to process the text I want to classify, separate from the vectorizer used to create training and test data from the model.

谢谢

推荐答案

您已经安装了一个矢量化器,但您将它扔掉了,因为它在您的 vectorize 函数的生命周期之后就不存在了.相反,在转换后将模型保存在 vectorize 中:

You've fitted a vectorizer, but you throw it away because it doesn't exist past the lifetime of your vectorize function. Instead, save your model in vectorize after it's been transformed:

self._vectorizer = vectorizer

然后在您的 classify 函数中,不要创建新的向量化器.相反,使用适合训练数据的那个:

Then in your classify function, don't create a new vectorizer. Instead, use the one you'd fitted to the training data:

input_counts = self._vectorizer.transform(input_text)

这篇关于NotFittedError: TfidfVectorizer - 未安装词汇的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆