NotFittedError:TfidfVectorizer-词汇不正确 [英] NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted
问题描述
我正在尝试使用scikit-learn/pandas构建情感分析器.建立和评估模型是可行的,但是尝试对新的示例文本进行分类则无法.
I am trying to build a sentiment analyzer using scikit-learn/pandas. Building and evaluating the model works, but attempting to classify new sample text does not.
我的代码:
import csv
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
infile = 'Sentiment_Analysis_Dataset.csv'
data = "SentimentText"
labels = "Sentiment"
class Classifier():
def __init__(self):
self.train_set, self.test_set = self.load_data()
self.counts, self.test_counts = self.vectorize()
self.classifier = self.train_model()
def load_data(self):
df = pd.read_csv(infile, header=0, error_bad_lines=False)
train_set, test_set = train_test_split(df, test_size=.3)
return train_set, test_set
def train_model(self):
classifier = BernoulliNB()
targets = self.train_set[labels]
classifier.fit(self.counts, targets)
return classifier
def vectorize(self):
vectorizer = TfidfVectorizer(min_df=5,
max_df = 0.8,
sublinear_tf=True,
ngram_range = (1,2),
use_idf=True)
counts = vectorizer.fit_transform(self.train_set[data])
test_counts = vectorizer.transform(self.test_set[data])
return counts, test_counts
def evaluate(self):
test_counts,test_set = self.test_counts, self.test_set
predictions = self.classifier.predict(test_counts)
print (classification_report(test_set[labels], predictions))
print ("The accuracy score is {:.2%}".format(accuracy_score(test_set[labels], predictions)))
def classify(self, input):
input_text = input
input_vectorizer = TfidfVectorizer(min_df=5,
max_df = 0.8,
sublinear_tf=True,
ngram_range = (1,2),
use_idf=True)
input_counts = input_vectorizer.transform(input_text)
predictions = self.classifier.predict(input_counts)
print(predictions)
myModel = Classifier()
text = ['I like this I feel good about it', 'give me 5 dollars']
myModel.classify(text)
myModel.evaluate()
错误:
Traceback (most recent call last):
File "sentiment.py", line 74, in <module>
myModel.classify(text)
File "sentiment.py", line 66, in classify
input_counts = input_vectorizer.transform(input_text)
File "/home/rachel/Sentiment/ENV/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 1380, in transform
X = super(TfidfVectorizer, self).transform(raw_documents)
File "/home/rachel/Sentiment/ENV/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 890, in transform
self._check_vocabulary()
File "/home/rachel/Sentiment/ENV/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 278, in _check_vocabulary
check_is_fitted(self, 'vocabulary_', msg=msg),
File "/home/rachel/Sentiment/ENV/lib/python3.5/site-packages/sklearn/utils/validation.py", line 690, in check_is_fitted
raise _NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.exceptions.NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.
我不确定这可能是什么问题.在我的分类方法中,我创建了一个全新的矢量化器来处理要分类的文本,与用于从模型中创建训练和测试数据的矢量化器分开.
I'm not sure what the issue could be. In my classify method, I create a brand new vectorizer to process the text I want to classify, separate from the vectorizer used to create training and test data from the model.
谢谢
推荐答案
您已经安装了矢量化程序,但由于在vectorize
函数的生命周期内不存在矢量化程序,因此将其丢弃.相反,将模型转换后保存在vectorize
中:
You've fitted a vectorizer, but you throw it away because it doesn't exist past the lifetime of your vectorize
function. Instead, save your model in vectorize
after it's been transformed:
self._vectorizer = vectorizer
然后在您的classify
函数中,不要创建新的矢量化器.相反,请使用适合您的训练数据的
Then in your classify
function, don't create a new vectorizer. Instead, use the one you'd fitted to the training data:
input_counts = self._vectorizer.transform(input_text)
这篇关于NotFittedError:TfidfVectorizer-词汇不正确的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!