快速n-gram计算 [英] Fast n-gram calculation

查看:310
本文介绍了快速n-gram计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用NLTK在语料库中搜索n-gram,但是在某些情况下,这花费了很长时间.我已经注意到,计算n元语法在其他软件包中并不罕见(显然,Haystack具有某些功能).如果我放弃NLTK,这是​​否意味着可以以更快的方式在语料库中查找n-grams?如果是这样,我可以使用什么来加快速度?

I'm using NLTK to search for n-grams in a corpus but it's taking a very long time in some cases. I've noticed calculating n-grams isn't an uncommon feature in other packages (apparently Haystack has some functionality for it). Does this mean there's a potentially faster way of finding n-grams in my corpus if I abandon NLTK? If so, what can I use to speed things up?

推荐答案

由于您没有指定要使用单词级还是字符级的n-gram,因此我将假设前者,而不会失去一般性.

Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality.

我还假设您从以字符串表示的令牌列表开始.您可以轻松地做自己写n-gram提取.

I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself.

def ngrams(tokens, MIN_N, MAX_N):
    n_tokens = len(tokens)
    for i in xrange(n_tokens):
        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
            yield tokens[i:j]

然后将yield替换为要对每个n-gram进行的实际操作(将其添加到dict,将其存储在数据库中,无论如何),以消除生成器开销.

Then replace the yield with the actual action you want to take on each n-gram (add it to a dict, store it in a database, whatever) to get rid of the generator overhead.

最后,如果确实不够快,请将以上内容转换为 Cython 并进行编译.使用defaultdict代替yield的示例:

Finally, if it's really not fast enough, convert the above to Cython and compile it. Example using a defaultdict instead of yield:

def ngrams(tokens, int MIN_N, int MAX_N):
    cdef Py_ssize_t i, j, n_tokens

    count = defaultdict(int)

    join_spaces = " ".join

    n_tokens = len(tokens)
    for i in xrange(n_tokens):
        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
            count[join_spaces(tokens[i:j])] += 1

    return count

这篇关于快速n-gram计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆