python中的n克,四克,五克,六克? [英] n-grams in python, four, five, six grams?

查看:162
本文介绍了python中的n克,四克,五克,六克?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种将文本拆分为n-gram的方法. 通常我会做类似的事情:

I'm looking for a way to split a text into n-grams. Normally I would do something like:

import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams

我知道nltk仅提供二元组和三元组,但是有没有办法将我的文本分为四克,五克甚至几百克?

I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams?

谢谢!

推荐答案

其他用户提供的基于本地Python的出色答案.但是这是nltk方法(以防万一,OP会因重新发明nltk库中已经存在的内容而受到惩罚).

Great native python based answers given by other users. But here's the nltk approach (just in case, the OP gets penalized for reinventing what's already existing in the nltk library).

有一个 ngram模块,人们很少在.这不是因为很难读取ngram,而是因为在ngram上训练模型,其中n> 3将导致大量的数据稀疏性.

There is an ngram module that people seldom use in nltk. It's not because it's hard to read ngrams, but training a model base on ngrams where n > 3 will result in much data sparsity.

from nltk import ngrams

sentence = 'this is a foo bar sentences and i want to ngramize it'

n = 6
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
  print grams

这篇关于python中的n克,四克,五克,六克?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆