快速n-gram计算 [英] Fast n-gram calculation

查看：310 发布时间：2020/5/18 0:29:38 python nlp nltk n-gram

本文介绍了快速n-gram计算的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用NLTK在语料库中搜索n-gram，但是在某些情况下，这花费了很长时间.我已经注意到，计算n元语法在其他软件包中并不罕见(显然，Haystack具有某些功能).如果我放弃NLTK，这是否意味着可以以更快的方式在语料库中查找n-grams?如果是这样，我可以使用什么来加快速度?

I'm using NLTK to search for n-grams in a corpus but it's taking a very long time in some cases. I've noticed calculating n-grams isn't an uncommon feature in other packages (apparently Haystack has some functionality for it). Does this mean there's a potentially faster way of finding n-grams in my corpus if I abandon NLTK? If so, what can I use to speed things up?

推荐答案

由于您没有指定要使用单词级还是字符级的n-gram，因此我将假设前者，而不会失去一般性.

Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality.

我还假设您从以字符串表示的令牌列表开始.您可以轻松地做自己写n-gram提取.

I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself.

def ngrams(tokens, MIN_N, MAX_N):
    n_tokens = len(tokens)
    for i in xrange(n_tokens):
        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
            yield tokens[i:j]

然后将yield替换为要对每个n-gram进行的实际操作(将其添加到dict，将其存储在数据库中，无论如何)，以消除生成器开销.

Then replace the yield with the actual action you want to take on each n-gram (add it to a dict, store it in a database, whatever) to get rid of the generator overhead.

最后，如果确实不够快，请将以上内容转换为 Cython 并进行编译.使用defaultdict代替yield的示例:

Finally, if it's really not fast enough, convert the above to Cython and compile it. Example using a defaultdict instead of yield:

def ngrams(tokens, int MIN_N, int MAX_N):
    cdef Py_ssize_t i, j, n_tokens

    count = defaultdict(int)

    join_spaces = " ".join

    n_tokens = len(tokens)
    for i in xrange(n_tokens):
        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
            count[join_spaces(tokens[i:j])] += 1

    return count

这篇关于快速n-gram计算的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

快速n-gram计算 [英] Fast n-gram calculation

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

快速n-gram计算 [英] Fast n-gram calculation

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭