为什么classifier.predict()方法期望测试数据中的特征数量与训练数据中的特征数量相同? [英] Why does classifier.predict() method expects the number of features in the test data to be the same as in training data?

查看:1053
本文介绍了为什么classifier.predict()方法期望测试数据中的特征数量与训练数据中的特征数量相同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用scikit-learn构建一个简单的SVM文档分类器,并且正在使用以下代码:

I am trying to build a simple SVM document classifier using scikit-learn and I am using the following code :

import os

import numpy as np

import scipy.sparse as sp

from sklearn.metrics import accuracy_score

from sklearn import svm

from sklearn.metrics import classification_report

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn import cross_validation
from sklearn.datasets import load_svmlight_file

clf=svm.SVC()

path="C:\\Python27"


f1=[]

f2=[]
data2=['omg this is not a ship lol']

f=open(path+'\\mydata\\ACQ\\acqtot','r')

f=f.read()

f1=f.split(';',1085)

for i in range(0,1086):

    f2.append('acq')



f1.append('shipping ship')

f2.append('crude')    

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=1)
counter = CountVectorizer(min_df=1)


x_train=vectorizer.fit_transform(f1)
x_test=vectorizer.fit_transform(data2)

num_sample,num_features=x_train.shape

test_sample,test_features=x_test.shape

print("#samples: %d, #features: %d" % (num_sample, num_features)) #samples: 5, #features: 25
print("#samples: %d, #features: %d" % (test_sample, test_features))#samples: 2, #features: 37

y=['acq','crude']

#print x_test.n_features

clf.fit(x_train,f2)


#den= clf.score(x_test,y)
clf.predict(x_test)

它给出了以下错误:

(n_features, self.shape_fit_[1]))
ValueError: X.shape[1] = 6 should be equal to 9451, the number of features at training time

但是我不明白的是为什么它期望没有.的功能要相同吗?如果我在需要预测的机器上输入绝对新的文本数据,则显然不可能每个文档都具有与用于训练它的数据相同数量的功能.在这种情况下,我们是否必须将测试数据的特征数显式设置为等于9451?

But what I am not understanding is why does it expect the no. of features to be the same? If I am entering an absolutely new text data to the machine which it needs to predict, it's obviously not possible that every document will have the same number of features as the data which was used to train it. Do we have to explicitly set the no of features of the test data to be equal to 9451 in this case?

推荐答案

为确保具有相同的要素表示,您不应该fit_transform测试数据,而只能对其进行转换.

To ensure that you have the same feature representation, you should not fit_transform your test data, but only transform it.

x_train=vectorizer.fit_transform(f1)
x_test=vectorizer.transform(data2)

类似的向同类特征的转换应应用于标签.

A similar transformation into homogeneous features should be applied to your labels.

这篇关于为什么classifier.predict()方法期望测试数据中的特征数量与训练数据中的特征数量相同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆