使用Python计算N克 [英] Computing N Grams using Python

查看:63
本文介绍了使用Python计算N克的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要为包含以下文本的文本文件计算Unigram,BiGrams和Trigrams:

I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like:

囊性纤维化仅在美国就影响30,000名儿童和年轻人 吸入盐水雾可以减少脓样和感染,这种脓和感染会充填囊性纤维化病患者的呼吸道,尽管其副作用包括令人讨厌的咳嗽症状和难闻的味道. 这是本周在《新英格兰医学杂志》上发表的两项研究的结论."

"Cystic fibrosis affects 30,000 children and young adults in the US alone Inhaling the mists of salt water can reduce the pus and infection that fills the airways of cystic fibrosis sufferers, although side effects include a nasty coughing fit and a harsh taste. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."

我从Python开始,并使用了以下代码:

I started in Python and used the following code:

#!/usr/bin/env python
# File: n-gram.py
def N_Gram(N,text):
NList = []                      # start with an empty list
if N> 1:
    space = " " * (N-1)         # add N - 1 spaces
    text = space + text + space # add both in front and back
# append the slices [i:i+N] to NList
for i in range( len(text) - (N - 1) ):
    NList.append(text[i:i+N])
return NList                    # return the list
# test code
for i in range(5):
print N_Gram(i+1,"text")
# more test code
nList = N_Gram(7,"Here is a lot of text to print")
for ngram in iter(nList):
print '"' + ngram + '"'

http://www.daniweb.com/software-development/python/threads/39109/generating-n-grams-from-a-word

但是,当我希望从单词之间进行CYSTIC和FIBROSIS或CYSTIC FIBROSIS时,它适用于单词中的所有n-gram.有人可以帮我解决这个问题吗?

But it works for all the n-grams within a word, when I want it from between words as in CYSTIC and FIBROSIS or CYSTIC FIBROSIS. Can someone help me out as to how I can get this done?

推荐答案

假定输入是一个包含空格的单词的字符串,例如x = "a b c d",则可以使用以下函数(请参见最后一个函数,以获取可能更完整的解决方案):

Assuming input is a string contains space separated words, like x = "a b c d" you can use the following function (edit: see the last function for a possibly more complete solution):

def ngrams(input, n):
    input = input.split(' ')
    output = []
    for i in range(len(input)-n+1):
        output.append(input[i:i+n])
    return output

ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]

如果您希望将这些字符串重新组合成字符串,则可以调用以下代码:

If you want those joined back into strings, you might call something like:

[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']

最后,这不会将所有内容汇总为总计,因此,如果您输入的内容为'a a a a',则需要将它们累加成字典:

Lastly, that doesn't summarize things into totals, so if your input was 'a a a a', you need to count them up into a dict:

for g in (' '.join(x) for x in ngrams(input, 2)):
    grams.setdefault(g, 0)
    grams[g] += 1

将所有内容放到一个最终函数中可以得出:

Putting that all together into one final function gives:

def ngrams(input, n):
   input = input.split(' ')
   output = {}
   for i in range(len(input)-n+1):
       g = ' '.join(input[i:i+n])
       output.setdefault(g, 0)
       output[g] += 1
    return output

ngrams('a a a a', 2) # {'a a': 3}

这篇关于使用Python计算N克的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆