为什么classifier.predict()方法期望测试数据中的特征数量与训练数据中的特征数量相同? [英] Why does classifier.predict() method expects the number of features in the test data to be the same as in training data?

查看：1053 发布时间：2020/5/4 9:48:28 python machine-learning scikit-learn svm

本文介绍了为什么classifier.predict()方法期望测试数据中的特征数量与训练数据中的特征数量相同?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用scikit-learn构建一个简单的SVM文档分类器，并且正在使用以下代码:

I am trying to build a simple SVM document classifier using scikit-learn and I am using the following code :

import os

import numpy as np

import scipy.sparse as sp

from sklearn.metrics import accuracy_score

from sklearn import svm

from sklearn.metrics import classification_report

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn import cross_validation
from sklearn.datasets import load_svmlight_file

clf=svm.SVC()

path="C:\\Python27"


f1=[]

f2=[]
data2=['omg this is not a ship lol']

f=open(path+'\\mydata\\ACQ\\acqtot','r')

f=f.read()

f1=f.split(';',1085)

for i in range(0,1086):

    f2.append('acq')



f1.append('shipping ship')

f2.append('crude')    

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=1)
counter = CountVectorizer(min_df=1)


x_train=vectorizer.fit_transform(f1)
x_test=vectorizer.fit_transform(data2)

num_sample,num_features=x_train.shape

test_sample,test_features=x_test.shape

print("#samples: %d, #features: %d" % (num_sample, num_features)) #samples: 5, #features: 25
print("#samples: %d, #features: %d" % (test_sample, test_features))#samples: 2, #features: 37

y=['acq','crude']

#print x_test.n_features

clf.fit(x_train,f2)


#den= clf.score(x_test,y)
clf.predict(x_test)

它给出了以下错误:

(n_features, self.shape_fit_[1]))
ValueError: X.shape[1] = 6 should be equal to 9451, the number of features at training time

但是我不明白的是为什么它期望没有.的功能要相同吗?如果我在需要预测的机器上输入绝对新的文本数据，则显然不可能每个文档都具有与用于训练它的数据相同数量的功能.在这种情况下，我们是否必须将测试数据的特征数显式设置为等于9451?

But what I am not understanding is why does it expect the no. of features to be the same? If I am entering an absolutely new text data to the machine which it needs to predict, it's obviously not possible that every document will have the same number of features as the data which was used to train it. Do we have to explicitly set the no of features of the test data to be equal to 9451 in this case?

为什么classifier.predict()方法期望测试数据中的特征数量与训练数据中的特征数量相同? [英] Why does classifier.predict() method expects the number of features in the test data to be the same as in training data?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

为什么classifier.predict()方法期望测试数据中的特征数量与训练数据中的特征数量相同? [英] Why does classifier.predict() method expects the number of features in the test data to be the same as in training data?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭