快速 n-gram 计算 [英] Fast n-gram calculation

查看:35
本文介绍了快速 n-gram 计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 NLTK 在语料库中搜索 n-gram,但在某些情况下需要很长时间.我注意到计算 n-grams 在其他包中并不是一个不常见的功能(显然 Haystack 有一些功能).这是否意味着如果我放弃 NLTK,有可能更快地在我的语料库中查找 n-gram?如果是这样,我可以用什么来加快速度?

I'm using NLTK to search for n-grams in a corpus but it's taking a very long time in some cases. I've noticed calculating n-grams isn't an uncommon feature in other packages (apparently Haystack has some functionality for it). Does this mean there's a potentially faster way of finding n-grams in my corpus if I abandon NLTK? If so, what can I use to speed things up?

推荐答案

由于您没有指明您想要单词级还是字符级 n-gram,我将假设前者,不失一般性.

Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality.

我还假设您从一个由字符串表示的令牌列表开始.您可以轻松地自己编写 n-gram 提取.

I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself.

def ngrams(tokens, MIN_N, MAX_N):
    n_tokens = len(tokens)
    for i in xrange(n_tokens):
        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
            yield tokens[i:j]

然后将 yield 替换为您想要对每个 n-gram 执行的实际操作(将其添加到 dict,将其存储在数据库中,无论如何)以摆脱生成器的开销.

Then replace the yield with the actual action you want to take on each n-gram (add it to a dict, store it in a database, whatever) to get rid of the generator overhead.

最后,如果真的不够快,把上面的转换成Cython 并编译.使用 defaultdict 而不是 yield 的示例:

Finally, if it's really not fast enough, convert the above to Cython and compile it. Example using a defaultdict instead of yield:

def ngrams(tokens, int MIN_N, int MAX_N):
    cdef Py_ssize_t i, j, n_tokens

    count = defaultdict(int)

    join_spaces = " ".join

    n_tokens = len(tokens)
    for i in xrange(n_tokens):
        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
            count[join_spaces(tokens[i:j])] += 1

    return count

这篇关于快速 n-gram 计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆