快速n-gram计算 [英] Fast n-gram calculation
问题描述
我正在使用NLTK在语料库中搜索n-gram,但是在某些情况下,这花费了很长时间.我已经注意到,计算n元语法在其他软件包中并不罕见(显然,Haystack具有某些功能).如果我放弃NLTK,这是否意味着可以以更快的方式在语料库中查找n-grams?如果是这样,我可以使用什么来加快速度?
I'm using NLTK to search for n-grams in a corpus but it's taking a very long time in some cases. I've noticed calculating n-grams isn't an uncommon feature in other packages (apparently Haystack has some functionality for it). Does this mean there's a potentially faster way of finding n-grams in my corpus if I abandon NLTK? If so, what can I use to speed things up?
推荐答案
由于您没有指定要使用单词级还是字符级的n-gram,因此我将假设前者,而不会失去一般性.
Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality.
我还假设您从以字符串表示的令牌列表开始.您可以轻松地做自己写n-gram提取.
I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself.
def ngrams(tokens, MIN_N, MAX_N):
n_tokens = len(tokens)
for i in xrange(n_tokens):
for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
yield tokens[i:j]
然后将yield
替换为要对每个n-gram进行的实际操作(将其添加到dict
,将其存储在数据库中,无论如何),以消除生成器开销.
Then replace the yield
with the actual action you want to take on each n-gram (add it to a dict
, store it in a database, whatever) to get rid of the generator overhead.
最后,如果确实不够快,请将以上内容转换为 Cython 并进行编译.使用defaultdict
代替yield
的示例:
Finally, if it's really not fast enough, convert the above to Cython and compile it. Example using a defaultdict
instead of yield
:
def ngrams(tokens, int MIN_N, int MAX_N):
cdef Py_ssize_t i, j, n_tokens
count = defaultdict(int)
join_spaces = " ".join
n_tokens = len(tokens)
for i in xrange(n_tokens):
for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
count[join_spaces(tokens[i:j])] += 1
return count
这篇关于快速n-gram计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!