什么是 ngram 计数以及如何使用 nltk 实现? [英] What are ngram counts and how to implement using nltk?

查看:47
本文介绍了什么是 ngram 计数以及如何使用 nltk 实现?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我读过一篇论文,它使用 ngram 计数作为分类器的特征,我想知道这究竟意味着什么.

I've read a paper that uses ngram counts as feature for a classifier, and I was wondering what this exactly means.

示例文本:Lorem ipsum dolor sat amet,consetetur satipscing elitr,sed diam"

Example text: "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam"

我可以从本文中创建 unigrams、bigrams、trigrams 等,我必须在其中定义创建这些 un​​igrams 的级别".级别"可以是字符、音节、单词、...

I can create unigrams, bigrams, trigrams, etc. out of this text, where I have to define on which "level" to create these unigrams. The "level" can be character, syllable, word, ...

所以从上面的句子中创建一元组只会创建一个所有单词的列表?

So creating unigrams out of the sentence above would simply create a list of all words?

创建二元组会导致单词对将相互跟随的单词组合在一起吗?

Creating bigrams would result in word pairs bringing together words that follow each other?

因此,如果论文谈论 ngram 计数,它只是从文本中创建 unigrams、bigrams、trigrams 等,并计算哪个 ngram 出现的频率?

So if the paper talks about ngram counts, it simply creates unigrams, bigrams, trigrams, etc. out of the text, and counts how often which ngram occurs?

python 的 nltk 包中是否有现有的方法?还是我必须实现自己的版本?

Is there an existing method in python's nltk package? Or do I have to implement a version of my own?

推荐答案

我找到了我的旧代码,也许有用.

I found my old code, maybe it's useful.

import nltk
from nltk import bigrams
from nltk import trigrams

text="""Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam ornare
tempor lacus, quis pellentesque diam tempus vitae. Morbi justo mauris,
congue sit amet imperdiet ipsum dolor sit amet, consectetur adipiscing elit. Nullam ornare
tempor lacus, quis pellentesque diam"""
# split the texts into tokens
tokens = nltk.word_tokenize(text)
tokens = [token.lower() for token in tokens if len(token) > 1] #same as unigrams
bi_tokens = bigrams(tokens)
tri_tokens = trigrams(tokens)

# print trigrams count

print [(item, tri_tokens.count(item)) for item in sorted(set(tri_tokens))]
>>> 
[(('adipiscing', 'elit.', 'nullam'), 2), (('amet', 'consectetur', 'adipiscing'), 2),(('amet', 'imperdiet', 'ipsum'), 1), (('congue', 'sit', 'amet'), 1), (('consectetur', 'adipiscing', 'elit.'), 2), (('diam', 'tempus', 'vitae.'), 1), (('dolor', 'sit', 'amet'), 2), (('elit.', 'nullam', 'ornare'), 2), (('imperdiet', 'ipsum', 'dolor'), 1), (('ipsum', 'dolor', 'sit'), 2), (('justo', 'mauris', 'congue'), 1), (('lacus', 'quis', 'pellentesque'), 2), (('lorem', 'ipsum', 'dolor'), 1), (('mauris', 'congue', 'sit'), 1), (('morbi', 'justo', 'mauris'), 1), (('nullam', 'ornare', 'tempor'), 2), (('ornare', 'tempor', 'lacus'), 2), (('pellentesque', 'diam', 'tempus'), 1), (('quis', 'pellentesque', 'diam'), 2), (('sit', 'amet', 'consectetur'), 2), (('sit', 'amet', 'imperdiet'), 1), (('tempor', 'lacus', 'quis'), 2), (('tempus', 'vitae.', 'morbi'), 1), (('vitae.', 'morbi', 'justo'), 1)]

这篇关于什么是 ngram 计数以及如何使用 nltk 实现?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆