为什么classifier.predict()方法期望测试数据中的特征数量与训练数据中的特征数量相同? [英] Why does classifier.predict() method expects the number of features in the test data to be the same as in training data?
问题描述
我正在尝试使用scikit-learn构建一个简单的SVM文档分类器,并且正在使用以下代码:
I am trying to build a simple SVM document classifier using scikit-learn and I am using the following code :
import os
import numpy as np
import scipy.sparse as sp
from sklearn.metrics import accuracy_score
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import cross_validation
from sklearn.datasets import load_svmlight_file
clf=svm.SVC()
path="C:\\Python27"
f1=[]
f2=[]
data2=['omg this is not a ship lol']
f=open(path+'\\mydata\\ACQ\\acqtot','r')
f=f.read()
f1=f.split(';',1085)
for i in range(0,1086):
f2.append('acq')
f1.append('shipping ship')
f2.append('crude')
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1)
counter = CountVectorizer(min_df=1)
x_train=vectorizer.fit_transform(f1)
x_test=vectorizer.fit_transform(data2)
num_sample,num_features=x_train.shape
test_sample,test_features=x_test.shape
print("#samples: %d, #features: %d" % (num_sample, num_features)) #samples: 5, #features: 25
print("#samples: %d, #features: %d" % (test_sample, test_features))#samples: 2, #features: 37
y=['acq','crude']
#print x_test.n_features
clf.fit(x_train,f2)
#den= clf.score(x_test,y)
clf.predict(x_test)
它给出了以下错误:
(n_features, self.shape_fit_[1]))
ValueError: X.shape[1] = 6 should be equal to 9451, the number of features at training time
但是我不明白的是为什么它期望没有.的功能要相同吗?如果我在需要预测的机器上输入绝对新的文本数据,则显然不可能每个文档都具有与用于训练它的数据相同数量的功能.在这种情况下,我们是否必须将测试数据的特征数显式设置为等于9451?
But what I am not understanding is why does it expect the no. of features to be the same? If I am entering an absolutely new text data to the machine which it needs to predict, it's obviously not possible that every document will have the same number of features as the data which was used to train it. Do we have to explicitly set the no of features of the test data to be equal to 9451 in this case?
推荐答案
为确保具有相同的要素表示,您不应该fit_transform测试数据,而只能对其进行转换.
To ensure that you have the same feature representation, you should not fit_transform your test data, but only transform it.
x_train=vectorizer.fit_transform(f1)
x_test=vectorizer.transform(data2)
类似的向同类特征的转换应应用于标签.
A similar transformation into homogeneous features should be applied to your labels.
这篇关于为什么classifier.predict()方法期望测试数据中的特征数量与训练数据中的特征数量相同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!