如何强制scikit学习DictVectorizer不放弃功能? [英] how to force scikit-learn DictVectorizer not to discard features?
问题描述
我正在尝试使用scikit-learn进行分类任务。
我的代码从数据中提取特征,并将其存储在字典中,如下所示:
Im trying to use scikit-learn for a classification task. My code extracts features from the data, and stores them in a dictionary like so:
feature_dict['feature_name_1'] = feature_1
feature_dict['feature_name_2'] = feature_2
当我在其中拆分数据时为了使用 sklearn.cross_validation
进行测试,一切都会正常进行。 Im遇到的问题是,当测试数据是新集而不是学习集的一部分时(尽管每个样本具有相同的确切特征)。在将分类器适合学习集之后,当我尝试调用 clf.predict
时,出现以下错误:
when I split the data in order to test it using sklearn.cross_validation
everything works as it should. The problem Im having is when the test data is a new set, not part of the learning set (although it has the same exact features for each sample). after I fit the classifier on the learning set, when I try to call clf.predict
I get this error:
ValueError: X has different number of features than during model fitting.
我假设这与此有关(在DictVectorizer文档中):
I am assuming this has to do with this (out of the DictVectorizer docs):
在fit或fit_transform期间未遇到的命名功能将被
忽略。
Named features not encountered during fit or fit_transform will be silently ignored.
DictVectorizer
删除了一些我猜想的功能...如何禁用/解决该功能?
DictVectorizer
has removed some of the features I guess... How do I disable/work around this feature?
谢谢
===编辑===
问题出在larsMans建议我两次安装DictVectorizer。
The problem was as larsMans suggested that I was fitting the DictVectorizer twice.
推荐答案
您应该使用 fit_transform
,在测试集上仅 transform
。
You should use fit_transform
on the training set, and only transform
on the test set.
这篇关于如何强制scikit学习DictVectorizer不放弃功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!