当我尝试将 tf-idf 应用于测试集时维度不匹配 [英] Dimension mismatch when I try to apply tf-idf to test set
问题描述
我正在尝试将新的预处理算法应用于我的数据集,遵循以下答案:在机器学习分类器中编码文本
I am trying to apply a new pre-processing algorithm to my dataset, following this answer: Encoding text in ML classifier
我现在尝试的是以下内容:
What I have tried now is the following:
def test_tfidf(data, ngrams = 1):
df_temp = data.copy(deep = True)
df_temp = basic_preprocessing(df_temp)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngrams))
tfidf_vectorizer.fit(df_temp['Text'])
list_corpus = df_temp["Text"].tolist()
list_labels = df_temp["Label"].tolist()
X = tfidf_vectorizer.transform(list_corpus)
return X, list_labels
(我建议参考我上面提到的所有代码的链接).当我尝试将后两个函数应用于我的数据集时:
(I would suggest to refer to the link I mentioned above for all the code). When I try to apply the latter two function to my dataset:
train_x, train_y, count_vectorizer = tfidf(undersample_train, ngrams = 1)
testing_set = pd.concat([X_test, y_test], axis=1)
test_x, test_y = test_tfidf(testing_set, ngrams = 1)
full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y), ignore_index = True)
我收到此错误:
---> 12 full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y, ), ignore_index = True)
---> 14 y_pred = clf.predict(X_test_naive)
ValueError: dimension mismatch
错误中提到的函数是:
def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
clf = MultinomialNB()
clf.fit(X_train_naive, y_train_naive)
y_pred = clf.predict(X_test_naive)
return
任何有助于理解我的新定义和/或将 tf-idf 应用于我的数据集的错误(请参阅此处了解相关部分:在 ML 分类器中编码文本),不胜感激.
Any help in understanding what is wrong in my new definition and/or in applying the tf-idf to my dataset (please refer here for the relevant parts: Encoding text in ML classifier), it would be appreciated.
更新:我认为这个问题/答案对帮助我找出问题也很有用:scikit-learn ValueError: 维度不匹配
Update: I think this question/answer might be useful as well for helping me in figure out the issue: scikit-learn ValueError: dimension mismatch
如果我将 test_x, test_y = test_tfidf(testing_set, ngrams = 1)
替换为 test_x, test_y = test_tfidf(undersample_train, ngrams = 1)
它不会返回任何错误.但是,我认为这是不对的,因为我得到的值非常高(所有统计数据为 99%)
if I replace test_x, test_y = test_tfidf(testing_set, ngrams = 1)
with test_x, test_y = test_tfidf(undersample_train, ngrams = 1)
it does not return any error. However, I do not think it is right, as I am getting values very very high (99% on all statistics)
推荐答案
使用变换(在本例中为TfidfVectorizer
)时,必须使用相同的对象来变换训练和测试数据.转换器通常仅使用训练数据进行拟合,然后重新用于转换测试数据.
When using transformes (TfidfVectorizer
in this case), you must use the same object ot transform both train and test data. The transformer is typically fitted using the training data only, and then re-used to transform the test data.
在您的情况下执行此操作的正确方法:
The correct way to do this in your case:
def tfidf(data, ngrams = 1):
df_temp = data.copy(deep = True)
df_temp = basic_preprocessing(df_temp)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngrams))
tfidf_vectorizer.fit(df_temp['Text'])
list_corpus = df_temp["Text"].tolist()
list_labels = df_temp["Label"].tolist()
X = tfidf_vectorizer.transform(list_corpus)
return X, list_labels, tfidf_vectorizer
def test_tfidf(data, vectorizer, ngrams = 1):
df_temp = data.copy(deep = True)
df_temp = basic_preprocessing(df_temp)
# No need to create a new TfidfVectorizer here!
list_corpus = df_temp["Text"].tolist()
list_labels = df_temp["Label"].tolist()
X = vectorizer.transform(list_corpus)
return X, list_labels
# this method is copied from the other SO question
def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
clf = MultinomialNB() # Gaussian Naive Bayes
clf.fit(X_train_naive, y_train_naive)
res = pd.DataFrame(columns = ['Preprocessing', 'Model', 'Precision', 'Recall', 'F1-score', 'Accuracy'])
y_pred = clf.predict(X_test_naive)
f1 = f1_score(y_pred, y_test_naive, average = 'weighted')
pres = precision_score(y_pred, y_test_naive, average = 'weighted')
rec = recall_score(y_pred, y_test_naive, average = 'weighted')
acc = accuracy_score(y_pred, y_test_naive)
res = res.append({'Preprocessing': preproc, 'Model': 'Naive Bayes', 'Precision': pres,
'Recall': rec, 'F1-score': f1, 'Accuracy': acc}, ignore_index = True)
return res
train_x, train_y, count_vectorizer = tfidf(undersample_train, ngrams = 1)
testing_set = pd.concat([X_test, y_test], axis=1)
test_x, test_y = test_tfidf(testing_set, count_vectorizer, ngrams = 1)
full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y, count_vectorizer), ignore_index = True)
这篇关于当我尝试将 tf-idf 应用于测试集时维度不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!