CountVectorizer MultinomialNB中的尺寸不匹配错误 [英] dimension mismatch error in CountVectorizer MultinomialNB

查看:198
本文介绍了CountVectorizer MultinomialNB中的尺寸不匹配错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在提出这个问题之前,我不得不说,我已经在此板上彻底阅读了15个以上的相似主题,每个主题都有不同的建议,但所有这些都无法使我正确.

Before I lodge this question, I have to say I've thoroughly read more than 15 similar topics on this board, each with somehow different recommendations, but all of them just could not get me right.

好吧,所以我使用CountVectorizer及其"fit_transform"功能将语料库的文本数据(最初为csv格式)分为训练集和测试集,以适应语料库的词汇量并从文本中提取字数统计功能.然后,我应用MultinomialNB()从训练集中学习并预测测试集.这是我的代码(简体):

Ok, so I split my 'spam email' text data (originally in csv format) into training and test sets, using CountVectorizer and its 'fit_transform' function to fit the vocabulary of the corpus and extracts word count features from text. And then I applied MultinomialNB() to learn from training set and predict on test set. Here is my code (simplified):

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB

# loading data 
# data contains two columns ('text', 'target')

spam = pd.read_csv('spam.csv')
spam['target'] = np.where(spam_data['target']=='spam',1,0)

# split data
X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], spam_data['target'], random_state=0) 

# fit vocabulary and extract word count features
cv = CountVectorizer()
X_traincv = cv.fit_transform(X_train)  
X_testcv = cv.fit_transform(X_test)

# learn and predict using MultinomialNB
clfNB = MultinomialNB(alpha=0.1)
clfNB.fit(X_traincv, y_train)

# so far so good, but when I predict on X_testcv
y_pred = algo.predict(X_testcv)

# Python throws me an error: dimension mismatch

我从先前的问题线程中收集的建议是(1)在X_test上仅使用.transform(),或(2)确定原始垃圾邮件数据中的每一行是否都是字符串格式(是的),或者(3)对X_test不执行任何操作.但是所有人都没有敲响警钟,Python不断给我尺寸不匹配"错误.在挣扎了4个小时之后,我不得不屈服于Stackoverflow.如果有人能启发我,将不胜感激.只想知道我的代码出了什么问题以及如何正确设置尺寸.

The suggestions I gleaned from previous question threads are to (1) use only .transform() on X_test, or (2) ascertain if each row in the original spam data is on string format (yes, they are), or (3) do nothing on X_test. But all of them didn't ring the bell and Python kept giving me 'dimension mismatch' error. After struggling for 4 hours, I had to succumb to Stackoverflow. It will be truly appreciated if anyone could enlighten me on this. Just want to know what goes wrong with my code and how to get the dimension right.

谢谢.

顺便说一句,原始数据条目如下所示:

Btw, the original data entries look like this

_

                                         test   target
0 Go until jurong point, crazy.. Available only    0
1 Ok lar... Joking wif u oni...                    0
2 Free entry in 2 a wkly comp to win FA Cup fina   1
3 U dun say so early hor... U c already then say   0
4 Nah I don't think he goes to usf, he lives aro   0
5 FreeMsg Hey there darling it's been 3 week's n   1
6 WINNER!! As a valued network customer you have   1

推荐答案

您的CountVectorizer已适合训练数据.因此,对于您的测试数据,您只想调用transform(),而不是fit_transform().

Your CountVectorizer has already been fitted with the training data. So for your test data, you just want to call transform(), not fit_transform().

否则,如果再次在测试数据上使用fit_transform(),则会根据测试数据的唯一词汇获得不同的列.所以只适合训练一次.

Otherwise, if you use fit_transform() again on your test data, you get different columns based on the unique vocabulary of the test data. So just fit once for training.

X_testcv = cv.transform(X_test)

这篇关于CountVectorizer MultinomialNB中的尺寸不匹配错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆