使用 Python 计算 N Gram [英] Computing N Grams using Python

查看:25
本文介绍了使用 Python 计算 N Gram的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要为包含如下文本的文本文件计算 Unigrams、BiGrams 和 Trigrams:

I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like:

仅在美国,囊性纤维化就影响了 30,000 名儿童和年轻人吸入盐水雾可以减少充满囊性纤维化患者气道的脓液和感染,尽管副作用包括令人讨厌的咳嗽和刺鼻的味道.这是本周出版的《新英格兰医学杂志》上发表的两项研究的结论."

"Cystic fibrosis affects 30,000 children and young adults in the US alone Inhaling the mists of salt water can reduce the pus and infection that fills the airways of cystic fibrosis sufferers, although side effects include a nasty coughing fit and a harsh taste. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."

我开始使用 Python 并使用以下代码:

I started in Python and used the following code:

#!/usr/bin/env python
# File: n-gram.py
def N_Gram(N,text):
NList = []                      # start with an empty list
if N> 1:
    space = " " * (N-1)         # add N - 1 spaces
    text = space + text + space # add both in front and back
# append the slices [i:i+N] to NList
for i in range( len(text) - (N - 1) ):
    NList.append(text[i:i+N])
return NList                    # return the list
# test code
for i in range(5):
print N_Gram(i+1,"text")
# more test code
nList = N_Gram(7,"Here is a lot of text to print")
for ngram in iter(nList):
print '"' + ngram + '"'

http://www.daniweb.com/software-development/python/threads/39109/generate-n-grams-from-a-word

但它适用于一个单词中的所有 n-gram,当我想要它在 CYSTIC 和 FIBROSIS 或 CYSTIC FIBROSIS 中的单词之间时.有人可以帮助我了解如何完成这项工作吗?

But it works for all the n-grams within a word, when I want it from between words as in CYSTIC and FIBROSIS or CYSTIC FIBROSIS. Can someone help me out as to how I can get this done?

推荐答案

假设 input 是一个包含空格分隔的单词的字符串,比如 x = "abcd" 你可以使用下面的函数(见可能更完整的解决方案的最后一个功能):

Assuming input is a string contains space separated words, like x = "a b c d" you can use the following function (edit: see the last function for a possibly more complete solution):

def ngrams(input, n):
    input = input.split(' ')
    output = []
    for i in range(len(input)-n+1):
        output.append(input[i:i+n])
    return output

ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]

如果你想把那些连接回字符串,你可以调用类似的东西:

If you want those joined back into strings, you might call something like:

[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']

最后,这不会将事物汇总为总数,因此如果您的输入是 'a a a',则需要将它们计入一个字典:

Lastly, that doesn't summarize things into totals, so if your input was 'a a a a', you need to count them up into a dict:

for g in (' '.join(x) for x in ngrams(input, 2)):
    grams.setdefault(g, 0)
    grams[g] += 1

将所有这些放在一个最终函数中给出:

Putting that all together into one final function gives:

def ngrams(input, n):
   input = input.split(' ')
   output = {}
   for i in range(len(input)-n+1):
       g = ' '.join(input[i:i+n])
       output.setdefault(g, 0)
       output[g] += 1
    return output

ngrams('a a a a', 2) # {'a a': 3}

这篇关于使用 Python 计算 N Gram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆