如何在python nltk中获取n-gram搭配和关联? [英] How to get n-gram collocations and association in python nltk?

查看：283 发布时间：2020/5/18 0:47:14 python nlp nltk n-gram collocation

本文介绍了如何在python nltk中获取n-gram搭配和关联?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在本文档中，有使用，BigramCollocationFinder，nltk.collocations.TrigramAssocMeasures()和TrigramCollocationFinder.

有一个基于pmi的bigram和trigram查找nbest的示例方法. 示例:

There is example method find nbest based on pmi for bigram and trigram. example:

finder = BigramCollocationFinder.from_words(
...     nltk.corpus.genesis.words('english-web.txt'))
>>> finder.nbest(bigram_measures.pmi, 10)

我知道BigramCollocationFinder和TrigramCollocationFinder是从AbstractCollocationFinder.继承的，而BigramAssocMeasures()和TrigramAssocMeasures()是从NgramAssocMeasures.

I know that BigramCollocationFinder and TrigramCollocationFinder inherit from AbstractCollocationFinder. While BigramAssocMeasures() and TrigramAssocMeasures() inherit from NgramAssocMeasures.

我如何使用AbstractCollocationFinder和NgramAssocMeasures中的方法(例如nbest())处理4克，5克，6克，....，n克(例如使用bigram和trigram)轻松)?

How can I use the methods(e.g. nbest()) in AbstractCollocationFinder and NgramAssocMeasures for 4-gram, 5-gram, 6-gram, ...., n-gram (like using bigram and trigram easily)?

我应该创建继承AbstractCollocationFinder的类吗?

Should I create class which inherit AbstractCollocationFinder?

谢谢.

已编辑

当前的NLTK具有硬编码器功能，最多可用于，但是不能简单地创建NgramCollocationFinder的原因仍然存在，您必须针对不同的ngram顺序彻底更改from_words()函数中的公式.

Edited

The current NLTK has a hardcoder function for up to QuadCollocationFinder but the reasoning for why you cannot simply create an NgramCollocationFinder still stands, you would have to radically change the formulas in the from_words() function for different order of ngram.

简短的回答，不，如果要查找2克和3克以外的搭配，您不能简单地创建AbstractCollocationFinder(ACF)来调用nbest()函数.

Short answer, no you cannot simply create an AbstractCollocationFinder (ACF) to call the nbest() function if you want to find collocations beyond 2- and 3-grams.

这是因为from_words()对于不同的ngram的不同.您会看到只有ACF的子类(即BigramCF和TrigramCF)具有from_words()函数.

It's because of the difference in the from_words() for different ngrams. You see that only the subclass of ACF (i.e. BigramCF and TrigramCF) have the from_words() function.

>>> finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt'))
>>> finder = AbstractCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt',5))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'AbstractCollocationFinder' has no attribute 'from_words'

因此在TrigramCF中给出了from_words():

So given this from_words() in TrigramCF:

from nltk.probability import FreqDist
@classmethod
def from_words(cls, words):
    wfd, wildfd, bfd, tfd = (FreqDist(),)*4

    for w1,w2,w3 in ingrams(words,3,pad_right=True):
      wfd.inc(w1)

      if w2 is None:
        continue
      bfd.inc((w1,w2))

      if w3 is None:
        continue
      wildfd.inc((w1,w3))
      tfd.inc((w1,w2,w3))

    return cls(wfd, bfd, wildfd, tfd)

您可以通过某种方式对其进行破解，并尝试对这样的4克关联查找程序进行硬编码:

You could somehow hack it and try to hardcode for a 4-gram association finder as such:

@classmethod
def from_words(cls, words):
    wfd, wildfd = (FreqDist(),)*2
    bfd, tfd ,fofd = (FreqDist(),)*3

    for w1,w2,w3,w4,w5 in ingrams(words,5,pad_right=True):
      wfd.inc(w1)

      if w2 is None:
        continue
      bfd.inc((w1,w2))

      if w3 is None:
        continue
      wildfd.inc((w1,w3))
      tfd.inc((w1,w2,w3))

      if w4 is None:
        continue
      wildfd.inc((w1,w4))
      wildfd.inc((w2,w4))
      wildfd.inc((w3,w4))
      wildfd.inc((w1,w3))
      wildfd.inc((w2,w3))
      wildfd.inc((w1,w2))
      ffd.inc((w1,w2,w3,w4))

    return cls(wfd, bfd, wildfd, tfd, ffd)

然后，您还必须更改使用分别从from_words返回的cls的代码的任何一部分.

Then you would also have to change whichever part of the code that uses cls returned from the from_words respectively.

因此，您必须问找到搭配的最终目的是什么?

So you have to ask what is the ultimate purpose of finding the collocations?

如果您正在寻找更大的搭配单词超过2或3克的窗口，那么您最终会得到很多单词检索中出现噪音.

If you're looking at retreiving words within collocations of larger than 2 or 3grams windows then you pretty much end up with a lot of noise in your word retrieval.

如果要使用2的搭配模式构建模型或3克的窗口，那么您还将面临稀疏性问题.

If you're going to build a model base on a collocation mode using 2 or 3grams windows then you will also face sparsity problems.

这篇关于如何在python nltk中获取n-gram搭配和关联?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在python nltk中获取n-gram搭配和关联? [英] How to get n-gram collocations and association in python nltk?

问题描述

推荐答案

已编辑

Edited

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在python nltk中获取n-gram搭配和关联? [英] How to get n-gram collocations and association in python nltk?

问题描述

推荐答案

已编辑

Edited

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭