CountVectorizer删除仅出现一次的功能 [英] CountVectorizer deleting features that only appear once
问题描述
我正在使用sklearn python软件包,并且在使用预先创建的字典创建CountVectorizer
时遇到了麻烦,其中CountVectorizer
不会删除仅出现一次或根本不出现的功能
I'm using the sklearn python package, and I am having trouble creating a CountVectorizer
with a pre-created dictionary, where the CountVectorizer
doesn't delete features that only appear once or don't appear at all.
这是我的示例代码:
train_count_vect, training_matrix, train_labels = setup_data(train_corpus, query, vocabulary=None)
test_count_vect, test_matrix, test_labels = setup_data(test_corpus, query, vocabulary=train_count_vect.get_feature_names())
print(len(train_count_vect.get_feature_names()))
print(len(test_count_vect.get_feature_names()))
len(train_count_vect.get_feature_names())
输出89967
len(test_count_vect.get_feature_names())
输出9833
在setup_data()
函数中,我只是在初始化CountVectorizer
.对于训练数据,我将在没有预设词汇的情况下对其进行初始化.然后,对于测试数据,我将使用从训练数据中检索到的词汇表来初始化CountVectorizer.
Inside the setup_data()
function, I am just initializing CountVectorizer
. For training data, I'm initializing it without a preset vocabulary. Then, for test data, I'm initializing CountVectorizer with the vocabulary I retrieved from my training data.
如何使词汇表的长度相同?我认为sklearn正在删除功能,因为它们在我的测试语料库中仅出现一次或根本没有出现.我需要使用相同的词汇表,否则分类器的长度将与测试数据点的长度不同.
How do I get the vocabularies to be the same lengths? I think sklearn is deleting features because they only appear once or don't appear at all in my test corpus. I need to have the same vocabulary because otherwise, my classifier will be of a different length from my test data points.
推荐答案
因此,如果没有实际看到setup_data
的源代码就很难说,但是我对这里发生的事情有相当不错的猜测. sklearn
遵循fit_transform
格式,这意味着有两个阶段,特别是fit
和transform
.
So, it's impossible to say without actually seeing the source code of setup_data
, but I have a pretty decent guess as to what is going on here. sklearn
follows the fit_transform
format, meaning there are two stages, specifically fit
, and transform
.
在CountVectorizer
的示例中,fit
阶段有效地创建了词汇表,而transform
步骤将您的输入文本转换为该词汇表空间.
In the example of the CountVectorizer
the fit
stage effectively creates the vocabulary, and the transform
step transforms your input text into that vocabulary space.
我的猜测是,您要在两个数据集上调用fit
而不是仅在其中一个上调用,如果要使结果一致,则需要在两个数据集上使用相同的适合"版本的CountVectorizer
.例如:
My guess is that you're calling fit
on both datasets instead of just one, you need to be using the same "fitted" version of CountVectorizer
on both if you want the results to line up. e.g.:
model = CountVectorizer()
transformed_train = model.fit_transform(train_corpus)
transformed_test = model.transform(test_corpus)
同样,这只能是一个猜测,直到您发布setup_data
函数为止,但是在看过这个之前,我猜您正在做类似这样的事情:
Again, this can only be a guess until you post the setup_data
function, but having seen this before I would guess you're doing something more like this:
model = CountVectorizer()
transformed_train = model.fit_transform(train_corpus)
transformed_test = model.fit_transform(test_corpus)
这将有效地为test_corpus
创建新的词汇表,这不足为奇,在两种情况下都不会为您提供相同的词汇表长度.
which will effectively make a new vocabulary for the test_corpus
, which unsurprisingly won't give you the same vocabulary length in both cases.
这篇关于CountVectorizer删除仅出现一次的功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!