python中的n克,四克,五克,六克? [英] n-grams in python, four, five, six grams?
问题描述
我正在寻找一种将文本拆分为n-gram的方法. 通常我会做类似的事情:
I'm looking for a way to split a text into n-grams. Normally I would do something like:
import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams
我知道nltk仅提供二元组和三元组,但是有没有办法将我的文本分为四克,五克甚至几百克?
I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams?
谢谢!
推荐答案
其他用户提供的基于本地Python的出色答案.但是这是nltk
方法(以防万一,OP会因重新发明nltk
库中已经存在的内容而受到惩罚).
Great native python based answers given by other users. But here's the nltk
approach (just in case, the OP gets penalized for reinventing what's already existing in the nltk
library).
有一个 ngram模块,人们很少在
There is an ngram module that people seldom use in nltk
. It's not because it's hard to read ngrams, but training a model base on ngrams where n > 3 will result in much data sparsity.
from nltk import ngrams
sentence = 'this is a foo bar sentences and i want to ngramize it'
n = 6
sixgrams = ngrams(sentence.split(), n)
for grams in sixgrams:
print grams
这篇关于python中的n克,四克,五克,六克?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!