如何使用 sklearn 的 CountVectorizerand() 获取包含任何标点符号作为单独标记的 ngram? [英] How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

查看：12 发布时间：2021/12/25 14:42:46 python nlp scikit-learn tokenize n-gram

本文介绍了如何使用 sklearn 的 CountVectorizerand() 获取包含任何标点符号作为单独标记的 ngram?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用 sklearn.feature_extraction.text.CountVectorizer 来计算n-gram.示例:

I use sklearn.feature_extraction.text.CountVectorizer to compute n-grams. Example:

import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

输出:

4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']

标点符号被删除:如何将它们作为单独的标记包含在内?

The punctuation is removed: how to include them as separate tokens?

推荐答案

你应该指定一个词在创建 sklearn.feature_extraction.text.CountVectorizer 实例，使用 tokenizer 参数.

You should specify a word tokenizer that considers any punctuation as a separate token when creating the sklearn.feature_extraction.text.CountVectorizer instance, using the tokenizer parameter.

例如，nltk.tokenize.TreebankWordTokenizer 将大多数标点符号视为单独的标记:

For example, nltk.tokenize.TreebankWordTokenizer treats most punctuation characters as separate tokens:

import sklearn.feature_extraction.text
from nltk.tokenize import TreebankWordTokenizer

ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), 
                                                 tokenizer=TreebankWordTokenizer().tokenize)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

输出:

4-grams: [u"'s pretty awesome .", u", it 's pretty", u'i really like python', 
          u"it 's pretty awesome", u'like python , it', u"python , it 's", 
          u'really like python ,']

这篇关于如何使用 sklearn 的 CountVectorizerand() 获取包含任何标点符号作为单独标记的 ngram?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用 sklearn 的 CountVectorizerand() 获取包含任何标点符号作为单独标记的 ngram? [英] How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使用 sklearn 的 CountVectorizerand() 获取包含任何标点符号作为单独标记的 ngram? [英] How to use sklearn&#39;s CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

如何使用 sklearn 的 CountVectorizerand() 获取包含任何标点符号作为单独标记的 ngram? [英] How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

登录关闭