有没有更有效的方法来查找最常见的n-gram? [英] Is there a more efficient way to find most common n-grams?
本文介绍了有没有更有效的方法来查找最常见的n-gram?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正试图从一个大型语料库中找到k个最常见的n-gram.我见过很多地方都建议采用朴素的方法-只需扫描整个语料库并保留所有n-gram计数的字典即可.有更好的方法吗?
I'm trying to find k most common n-grams from a large corpus. I've seen lots of places suggesting the naïve approach - simply scanning through the entire corpus and keeping a dictionary of the count of all n-grams. Is there a better way to do this?
推荐答案
在Python中,使用NLTK:
In Python, using NLTK:
$ wget http://norvig.com/big.txt
$ python
>>> from collections import Counter
>>> from nltk import ngrams
>>> bigtxt = open('big.txt').read()
>>> ngram_counts = Counter(ngrams(bigtxt.split(), 2))
>>> ngram_counts.most_common(10)
[(('of', 'the'), 12422), (('in', 'the'), 5741), (('to', 'the'), 4333), (('and', 'the'), 3065), (('on', 'the'), 2214), (('at', 'the'), 1915), (('by', 'the'), 1863), (('from', 'the'), 1754), (('of', 'a'), 1700), (('with', 'the'), 1656)]
在Python中,本机(请参见在python中快速/优化N-gram实现):
In Python, native (see Fast/Optimize N-gram implementations in python):
>>> import collections
>>> def ngrams(text, n=2):
... return zip(*[text[i:] for i in range(n)])
>>> ngram_counts = collections.Counter(ngrams(bigtxt.split(), 2))
>>> ngram_counts.most_common(10)
[(('of', 'the'), 12422), (('in', 'the'), 5741), (('to', 'the'), 4333), (('and', 'the'), 3065), (('on', 'the'), 2214), (('at', 'the'), 1915), (('by', 'the'), 1863), (('from', 'the'), 1754), (('of', 'a'), 1700), (('with', 'the'), 1656)]
在Julia中,请参见使用Julia生成ngram
In Julia, see Generate ngrams with Julia
import StatsBase: countmap
import Iterators: partition
bigtxt = readstring(open("big.txt"))
ngram_counts = countmap(collect(partition(split(bigtxt), 2, 1)))
大致时间:
$ time python ngram-test.py # With NLTK.
real 0m3.166s
user 0m2.274s
sys 0m0.528s
$ time python ngram-native-test.py
real 0m1.521s
user 0m1.317s
sys 0m0.145s
$ time julia ngram-test.jl
real 0m3.573s
user 0m3.188s
sys 0m0.306s
这篇关于有没有更有效的方法来查找最常见的n-gram?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文