如何在python nltk中获取n-gram搭配和关联? [英] How to get n-gram collocations and association in python nltk?
问题描述
在本文档中,有使用BigramCollocationFinder
,nltk.collocations.TrigramAssocMeasures()
和TrigramCollocationFinder
.
有一个基于pmi的bigram和trigram查找nbest的示例方法. 示例:
There is example method find nbest based on pmi for bigram and trigram. example:
finder = BigramCollocationFinder.from_words(
... nltk.corpus.genesis.words('english-web.txt'))
>>> finder.nbest(bigram_measures.pmi, 10)
我知道BigramCollocationFinder
和TrigramCollocationFinder
是从AbstractCollocationFinder.
继承的,而BigramAssocMeasures()
和TrigramAssocMeasures()
是从NgramAssocMeasures.
I know that BigramCollocationFinder
and TrigramCollocationFinder
inherit from AbstractCollocationFinder.
While BigramAssocMeasures()
and TrigramAssocMeasures()
inherit from NgramAssocMeasures.
我如何使用AbstractCollocationFinder
和NgramAssocMeasures
中的方法(例如nbest()
)处理4克,5克,6克,....,n克(例如使用bigram和trigram)轻松)?
How can I use the methods(e.g. nbest()
) in AbstractCollocationFinder
and NgramAssocMeasures
for 4-gram, 5-gram, 6-gram, ...., n-gram (like using bigram and trigram easily)?
我应该创建继承AbstractCollocationFinder
的类吗?
Should I create class which inherit AbstractCollocationFinder
?
谢谢.
推荐答案
已编辑
当前的NLTK具有硬编码器功能,最多可用于 NgramCollocationFinder
的原因仍然存在,您必须针对不同的ngram顺序彻底更改from_words()
函数中的公式.
Edited
The current NLTK has a hardcoder function for up to QuadCollocationFinder
but the reasoning for why you cannot simply create an NgramCollocationFinder
still stands, you would have to radically change the formulas in the from_words()
function for different order of ngram.
简短的回答,不,如果要查找2克和3克以外的搭配,您不能简单地创建AbstractCollocationFinder
(ACF)来调用nbest()
函数.
Short answer, no you cannot simply create an AbstractCollocationFinder
(ACF) to call the nbest()
function if you want to find collocations beyond 2- and 3-grams.
这是因为from_words()
对于不同的ngram的不同.您会看到只有ACF的子类(即BigramCF和TrigramCF)具有from_words()
函数.
It's because of the difference in the from_words()
for different ngrams. You see that only the subclass of ACF (i.e. BigramCF and TrigramCF) have the from_words()
function.
>>> finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt'))
>>> finder = AbstractCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt',5))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: type object 'AbstractCollocationFinder' has no attribute 'from_words'
因此在TrigramCF中给出了from_words()
:
So given this from_words()
in TrigramCF:
from nltk.probability import FreqDist
@classmethod
def from_words(cls, words):
wfd, wildfd, bfd, tfd = (FreqDist(),)*4
for w1,w2,w3 in ingrams(words,3,pad_right=True):
wfd.inc(w1)
if w2 is None:
continue
bfd.inc((w1,w2))
if w3 is None:
continue
wildfd.inc((w1,w3))
tfd.inc((w1,w2,w3))
return cls(wfd, bfd, wildfd, tfd)
您可以通过某种方式对其进行破解,并尝试对这样的4克关联查找程序进行硬编码:
You could somehow hack it and try to hardcode for a 4-gram association finder as such:
@classmethod
def from_words(cls, words):
wfd, wildfd = (FreqDist(),)*2
bfd, tfd ,fofd = (FreqDist(),)*3
for w1,w2,w3,w4,w5 in ingrams(words,5,pad_right=True):
wfd.inc(w1)
if w2 is None:
continue
bfd.inc((w1,w2))
if w3 is None:
continue
wildfd.inc((w1,w3))
tfd.inc((w1,w2,w3))
if w4 is None:
continue
wildfd.inc((w1,w4))
wildfd.inc((w2,w4))
wildfd.inc((w3,w4))
wildfd.inc((w1,w3))
wildfd.inc((w2,w3))
wildfd.inc((w1,w2))
ffd.inc((w1,w2,w3,w4))
return cls(wfd, bfd, wildfd, tfd, ffd)
然后,您还必须更改使用分别从from_words
返回的cls
的代码的任何一部分.
Then you would also have to change whichever part of the code that uses cls
returned from the from_words
respectively.
因此,您必须问找到搭配的最终目的是什么?
So you have to ask what is the ultimate purpose of finding the collocations?
-
如果您正在寻找更大的搭配单词 超过2或3克的窗口,那么您最终会得到很多 单词检索中出现噪音.
If you're looking at retreiving words within collocations of larger than 2 or 3grams windows then you pretty much end up with a lot of noise in your word retrieval.
如果要使用2的搭配模式构建模型 或3克的窗口,那么您还将面临稀疏性问题.
If you're going to build a model base on a collocation mode using 2 or 3grams windows then you will also face sparsity problems.
这篇关于如何在python nltk中获取n-gram搭配和关联?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!