如何在python nltk中获取n-gram搭配和关联? [英] How to get n-gram collocations and association in python nltk?

查看:283
本文介绍了如何在python nltk中获取n-gram搭配和关联?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

本文档中,有使用BigramCollocationFindernltk.collocations.TrigramAssocMeasures()TrigramCollocationFinder.

有一个基于pmi的bigram和trigram查找nbest的示例方法. 示例:

There is example method find nbest based on pmi for bigram and trigram. example:

finder = BigramCollocationFinder.from_words(
...     nltk.corpus.genesis.words('english-web.txt'))
>>> finder.nbest(bigram_measures.pmi, 10)

我知道BigramCollocationFinderTrigramCollocationFinder是从AbstractCollocationFinder.继承的,而BigramAssocMeasures()TrigramAssocMeasures()是从NgramAssocMeasures.

I know that BigramCollocationFinder and TrigramCollocationFinder inherit from AbstractCollocationFinder. While BigramAssocMeasures() and TrigramAssocMeasures() inherit from NgramAssocMeasures.

我如何使用AbstractCollocationFinderNgramAssocMeasures中的方法(例如nbest())处理4克,5克,6克,....,n克(例如使用bigram和trigram)轻松)?

How can I use the methods(e.g. nbest()) in AbstractCollocationFinder and NgramAssocMeasures for 4-gram, 5-gram, 6-gram, ...., n-gram (like using bigram and trigram easily)?

我应该创建继承AbstractCollocationFinder的类吗?

Should I create class which inherit AbstractCollocationFinder?

谢谢.

推荐答案

已编辑

当前的NLTK具有硬编码器功能,最多可用于 ,但是不能简单地创建NgramCollocationFinder的原因仍然存在,您必须针对不同的ngram顺序彻底更改from_words()函数中的公式.

Edited

The current NLTK has a hardcoder function for up to QuadCollocationFinder but the reasoning for why you cannot simply create an NgramCollocationFinder still stands, you would have to radically change the formulas in the from_words() function for different order of ngram.

简短的回答,不,如果要查找2克和3克以外的搭配,您不能简单地创建AbstractCollocationFinder(ACF)来调用nbest()函数.

Short answer, no you cannot simply create an AbstractCollocationFinder (ACF) to call the nbest() function if you want to find collocations beyond 2- and 3-grams.

这是因为from_words()对于不同的ngram的不同.您会看到只有ACF的子类(即BigramCF和TrigramCF)具有from_words()函数.

It's because of the difference in the from_words() for different ngrams. You see that only the subclass of ACF (i.e. BigramCF and TrigramCF) have the from_words() function.

>>> finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt'))
>>> finder = AbstractCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt',5))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'AbstractCollocationFinder' has no attribute 'from_words'

因此在TrigramCF中给出了from_words():

So given this from_words() in TrigramCF:

from nltk.probability import FreqDist
@classmethod
def from_words(cls, words):
    wfd, wildfd, bfd, tfd = (FreqDist(),)*4

    for w1,w2,w3 in ingrams(words,3,pad_right=True):
      wfd.inc(w1)

      if w2 is None:
        continue
      bfd.inc((w1,w2))

      if w3 is None:
        continue
      wildfd.inc((w1,w3))
      tfd.inc((w1,w2,w3))

    return cls(wfd, bfd, wildfd, tfd)

您可以通过某种方式对其进行破解,并尝试对这样的4克关联查找程序进行硬编码:

You could somehow hack it and try to hardcode for a 4-gram association finder as such:

@classmethod
def from_words(cls, words):
    wfd, wildfd = (FreqDist(),)*2
    bfd, tfd ,fofd = (FreqDist(),)*3

    for w1,w2,w3,w4,w5 in ingrams(words,5,pad_right=True):
      wfd.inc(w1)

      if w2 is None:
        continue
      bfd.inc((w1,w2))

      if w3 is None:
        continue
      wildfd.inc((w1,w3))
      tfd.inc((w1,w2,w3))

      if w4 is None:
        continue
      wildfd.inc((w1,w4))
      wildfd.inc((w2,w4))
      wildfd.inc((w3,w4))
      wildfd.inc((w1,w3))
      wildfd.inc((w2,w3))
      wildfd.inc((w1,w2))
      ffd.inc((w1,w2,w3,w4))

    return cls(wfd, bfd, wildfd, tfd, ffd)

然后,您还必须更改使用分别从from_words返回的cls的代码的任何一部分.

Then you would also have to change whichever part of the code that uses cls returned from the from_words respectively.

因此,您必须问找到搭配的最终目的是什么?

So you have to ask what is the ultimate purpose of finding the collocations?

  • 如果您正在寻找更大的搭配单词 超过2或3克的窗口,那么您最终会得到很多 单词检索中出现噪音.

  • If you're looking at retreiving words within collocations of larger than 2 or 3grams windows then you pretty much end up with a lot of noise in your word retrieval.

如果要使用2的搭配模式构建模型 或3克的窗口,那么您还将面临稀疏性问题.

If you're going to build a model base on a collocation mode using 2 or 3grams windows then you will also face sparsity problems.

这篇关于如何在python nltk中获取n-gram搭配和关联?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆