CountVectorizer删除仅出现一次的功能 [英] CountVectorizer deleting features that only appear once

查看:137
本文介绍了CountVectorizer删除仅出现一次的功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用sklearn python软件包,并且在使用预先创建的字典创建CountVectorizer时遇到了麻烦,其中CountVectorizer不会删除仅出现一次或根本不出现的功能

I'm using the sklearn python package, and I am having trouble creating a CountVectorizer with a pre-created dictionary, where the CountVectorizer doesn't delete features that only appear once or don't appear at all.

这是我的示例代码:

train_count_vect, training_matrix, train_labels = setup_data(train_corpus, query, vocabulary=None)
test_count_vect, test_matrix, test_labels = setup_data(test_corpus, query, vocabulary=train_count_vect.get_feature_names())

print(len(train_count_vect.get_feature_names()))
print(len(test_count_vect.get_feature_names()))

len(train_count_vect.get_feature_names())输出89967 len(test_count_vect.get_feature_names())输出9833

setup_data()函数中,我只是在初始化CountVectorizer.对于训练数据,我将在没有预设词汇的情况下对其进行初始化.然后,对于测试数据,我将使用从训练数据中检索到的词汇表来初始化CountVectorizer.

Inside the setup_data() function, I am just initializing CountVectorizer. For training data, I'm initializing it without a preset vocabulary. Then, for test data, I'm initializing CountVectorizer with the vocabulary I retrieved from my training data.

如何使词汇表的长度相同?我认为sklearn正在删除功能,因为它们在我的测试语料库中仅出现一次或根本没有出现.我需要使用相同的词汇表,否则分类器的长度将与测试数据点的长度不同.

How do I get the vocabularies to be the same lengths? I think sklearn is deleting features because they only appear once or don't appear at all in my test corpus. I need to have the same vocabulary because otherwise, my classifier will be of a different length from my test data points.

推荐答案

因此,如果没有实际看到setup_data的源代码就很难说,但是我对这里发生的事情有相当不错的猜测. sklearn遵循fit_transform格式,这意味着有两个阶段,特别是fittransform.

So, it's impossible to say without actually seeing the source code of setup_data, but I have a pretty decent guess as to what is going on here. sklearn follows the fit_transform format, meaning there are two stages, specifically fit, and transform.

CountVectorizer的示例中,fit阶段有效地创建了词汇表,而transform步骤将您的输入文本转换为该词汇表空间.

In the example of the CountVectorizer the fit stage effectively creates the vocabulary, and the transform step transforms your input text into that vocabulary space.

我的猜测是,您要在两个数据集上调用fit而不是仅在其中一个上调用,如果要使结果一致,则需要在两个数据集上使用相同的适合"版本的CountVectorizer.例如:

My guess is that you're calling fit on both datasets instead of just one, you need to be using the same "fitted" version of CountVectorizer on both if you want the results to line up. e.g.:

model = CountVectorizer()
transformed_train = model.fit_transform(train_corpus)
transformed_test = model.transform(test_corpus)

同样,这只能是一个猜测,直到您发布setup_data函数为止,但是在看过这个之前,我猜您正在做类似这样的事情:

Again, this can only be a guess until you post the setup_data function, but having seen this before I would guess you're doing something more like this:

model = CountVectorizer()
transformed_train = model.fit_transform(train_corpus)
transformed_test = model.fit_transform(test_corpus)

这将有效地为test_corpus创建新的词汇表,这不足为奇,在两种情况下都不会为您提供相同的词汇表长度.

which will effectively make a new vocabulary for the test_corpus, which unsurprisingly won't give you the same vocabulary length in both cases.

这篇关于CountVectorizer删除仅出现一次的功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆