在sklearn中的CountVectorizer中理解`ngram_range`参数 [英] Understanding the `ngram_range` argument in a CountVectorizer in sklearn

查看:704
本文介绍了在sklearn中的CountVectorizer中理解`ngram_range`参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对如何在Python的scikit-learn库中使用ngram感到有些困惑,特别是ngram_range参数在CountVectorizer中的工作方式.

I'm a little confused about how to use ngrams in the scikit-learn library in Python, specifically, how the ngram_range argument works in a CountVectorizer.

运行此代码:

from sklearn.feature_extraction.text import CountVectorizer
vocabulary = ['hi ', 'bye', 'run away']
cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1, 2))
print cv.vocabulary_

给我:

{'hi ': 0, 'bye': 1, 'run away': 2}

在给人以(显然是错误的)印象的地方,我会得到unigram和bigrams,就像这样:

Where I was under the (obviously mistaken) impression that I would get unigrams and bigrams, like this:

{'hi ': 0, 'bye': 1, 'run away': 2, 'run': 3, 'away': 4}

我在这里使用文档: http://scikit-learn.org/stable/modules/feature_extraction .html

很明显,我对如何使用ngram的理解有些错误.也许论证没有效果,或者我对实际的二元组有什么概念上的疑问!我很困惑如果有人提出建议,我将不胜感激.

Clearly there is something terribly wrong with my understanding of how to use ngrams. Perhaps the argument is having no effect or I have some conceptual issue with what an actual bigram is! I'm stumped. If anyone has a word of advice to throw my way, I'd be grateful.

更新:
我已经意识到自己的愚蠢行为.我的印象是ngram_range将影响词汇表,而不是语料库.

UPDATE:
I have realized the folly of my ways. I was under the impression that the ngram_range would affect the vocabulary, not the corpus.

推荐答案

显式设置vocabulary意味着不会从数据中学习词汇.如果未设置,则会得到:

Setting the vocabulary explicitly means no vocabulary is learned from data. If you don't set it, you get:

>>> v = CountVectorizer(ngram_range=(1, 2))
>>> pprint(v.fit(["an apple a day keeps the doctor away"]).vocabulary_)
{u'an': 0,
 u'an apple': 1,
 u'apple': 2,
 u'apple day': 3,
 u'away': 4,
 u'day': 5,
 u'day keeps': 6,
 u'doctor': 7,
 u'doctor away': 8,
 u'keeps': 9,
 u'keeps the': 10,
 u'the': 11,
 u'the doctor': 12}

显式词汇表限制了将从文本中提取的术语;词汇没有改变:

An explicit vocabulary restricts the terms that will be extracted from text; the vocabulary is not changed:

>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})
>>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()
array([[1, 1]])  # unigram and bigram found

(请注意,停用词过滤是在提取n-gram之前应用的,因此是"apple day".)

(Note that stopword filtering is applied before n-gram extraction, hence "apple day".)

这篇关于在sklearn中的CountVectorizer中理解`ngram_range`参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆