如何强制scikit学习DictVectorizer不放弃功能? [英] how to force scikit-learn DictVectorizer not to discard features?

查看:115
本文介绍了如何强制scikit学习DictVectorizer不放弃功能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用scikit-learn进行分类任务。
我的代码从数据中提取特征,并将其存储在字典中,如下所示:

Im trying to use scikit-learn for a classification task. My code extracts features from the data, and stores them in a dictionary like so:

feature_dict['feature_name_1'] = feature_1
feature_dict['feature_name_2'] = feature_2

当我在其中拆分数据时为了使用 sklearn.cross_validation 进行测试,一切都会正常进行。 Im遇到的问题是,当测试数据是新集而不是学习集的一部分时(尽管每个样本具有相同的确切特征)。在将分类器适合学习集之后,当我尝试调用 clf.predict 时,出现以下错误:

when I split the data in order to test it using sklearn.cross_validation everything works as it should. The problem Im having is when the test data is a new set, not part of the learning set (although it has the same exact features for each sample). after I fit the classifier on the learning set, when I try to call clf.predict I get this error:

ValueError: X has different number of features than during model fitting.

我假设这与此有关(在DictVectorizer文档中):

I am assuming this has to do with this (out of the DictVectorizer docs):


在fit或fit_transform期间未遇到的命名功能将被
忽略。

Named features not encountered during fit or fit_transform will be silently ignored.

DictVectorizer 删除了一些我猜想的功能...如何禁用/解决该功能?

DictVectorizer has removed some of the features I guess... How do I disable/work around this feature?

谢谢

===编辑===

问题出在larsMans建议我两次安装DictVectorizer。

The problem was as larsMans suggested that I was fitting the DictVectorizer twice.

推荐答案

您应该使用 fit_transform ,在测试集上仅 transform

You should use fit_transform on the training set, and only transform on the test set.

这篇关于如何强制scikit学习DictVectorizer不放弃功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆