我可以控制CountVectorizer对scikit学习中的语料库进行矢量化的方式吗? [英] Can I control the way the CountVectorizer vectorizes the corpus in scikit learn?

查看:108
本文介绍了我可以控制CountVectorizer对scikit学习中的语料库进行矢量化的方式吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scikit学习中的CountVectorizer,并且我可能正在尝试做一些并非为对象而设计的东西……但我不确定.

I am working with a CountVectorizer from scikit learn, and I'm possibly attempting to do some things that the object was not made for...but I'm not sure.

在获取发生次数方面:

vocabulary = ['hi', 'bye', 'run away!']
corpus = ['run away!']
cv = CountVectorizer(vocabulary=vocabulary)
X = cv.fit_transform(corpus)
print X.toarray()

给予:

[[0 0 0 0]]

我意识到的是CountVectorizer会将语料库分解为我认为是字母组合的东西:

What I'm realizing is that the CountVectorizer will break the corpus into what I believe is unigrams:

vocabulary = ['hi', 'bye', 'run']
corpus = ['run away!']
cv = CountVectorizer(vocabulary=vocabulary)
X = cv.fit_transform(corpus)
print X.toarray()

给出:

[[0 0 1]]

有什么方法可以准确地告诉CountVectorizer您如何对语料库进行矢量化吗?理想情况下,我希望按照第一个示例的结果.

Is there any way to tell the CountVectorizer exactly how you'd like to vectorize the corpus? Ideally I would like an outcome along the lines of the first example.

但是,老实说,我想知道是否有可能按照以下思路获得结果:

In all honestly, however, I'm wondering if it is at all possible to get an outcome along these lines:

vocabulary = ['hi', 'bye', 'run away!']
corpus = ['I want to run away!']
cv = CountVectorizer(vocabulary=vocabulary)
X = cv.fit_transform(corpus)
print X.toarray()

[[0 0 1]]

我在fit_transform方法的文档中看不到太多信息,该方法仅保留一个参数.如果有人有任何想法,我将不胜感激.谢谢!

I don't see much information in the documentation for the fit_transform method, which only takes one argument as it is. If anyone has any ideas I would be grateful. Thanks!

推荐答案

所需的参数称为ngram_range.您将元组(1,2)传递给构造函数以获得单字组和二元组.但是,您传入的词汇表必须为dict,其中ngrams为键,而整数为值.

The parameter you want is called ngram_range. You pass in a tuple (1,2) to the constructor to get unigrams and bigrams. However, the vocabulary you pass in needs to be a dict with ngrams as keys and integers as values.

In [20]: print CountVectorizer(vocabulary={'hi': 0, u'bye': 1, u'run away': 2}, ngram_range=(1,2)).fit_transform(['I want to run away!']).A
[[0 0 1]]

请注意,默认令牌生成器会在末尾删除感叹号,因此最后一个令牌是away.如果要进一步控制如何将字符串拆分为令牌,请遵循@BrenBarn的评论.

Note the default tokeniser removes the exclamation mark at the end, so the last token is away. If you want more control over how the string is broken up into tokens, follow @BrenBarn's comment.

这篇关于我可以控制CountVectorizer对scikit学习中的语料库进行矢量化的方式吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆