NLTK - Bigram 的计数频率 [英] NLTK - Counting Frequency of Bigram

查看:26
本文介绍了NLTK - Bigram 的计数频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个 Python 和 NLTK 新手问题.

This is a Python and NLTK newbie question.

我想找到一起出现超过 10 次并且具有最高 PMI 的双元组的频率.

I want to find frequency of bigrams which occur more than 10 times together and have the highest PMI.

为此,我正在使用此代码

For this, I am working with this code

def get_list_phrases(text):

    tweet_phrases = []

    for tweet in text:
        tweet_words = tweet.split()
        tweet_phrases.extend(tweet_words)


    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(tweet_phrases,window_size = 13)
    finder.apply_freq_filter(10)
    finder.nbest(bigram_measures.pmi,20)  

    for k,v in finder.ngram_fd.items():
      print(k,v)

然而,这并不将结果限制在前 20 名.我看到的结果频率小于10. 我是 Python 世界的新手.

However, this does not restricts the results to top 20. I see results which have frequency < 10. I am new to the world of Python.

谁能指出如何修改它以仅获得前 20 名.

Can someone please point out how to modify this to get only the top 20.

谢谢

推荐答案

问题在于您尝试使用 apply_freq_filter 的方式.我们正在讨论单词搭配.如您所知,单词搭配是关于单词之间的依赖关系.BigramCollocationFinder 类继承自名为AbstractCollocationFinder 的类,函数apply_freq_filter 属于该类.apply_freq_filter 不应该完全删除一些单词搭配,而是在其他函数尝试访问列表时提供过滤的搭配列表.

The problem is with the way you are trying to use apply_freq_filter. We are discussing about word collocations. As you know, a word collocation is about dependency between words. The BigramCollocationFinder class inherits from a class named AbstractCollocationFinder and the function apply_freq_filter belongs to this class. apply_freq_filter is not supposed to totally delete some word collocations, but to provide a filtered list of collocations if some other functions try to access the list.

现在这是为什么?想象一下,如果过滤搭配只是简单地删除它们,那么有许多概率度量,例如似然比或 PMI 本身(计算一个词相对于语料库中其他词的概率)在从随机位置删除词后将无法正常工作在给定的语料库中.通过从给定的单词列表中删除一些搭配,许多潜在的功能和计算将被禁用.此外,在删除之前计算所有这些度量,将带来用户可能根本不需要的大量计算开销.

Now why is that? Imagine that if filtering collocations was simply deleting them, then there were many probability measures such as likelihood ratio or the PMI itself (that compute probability of a word relative to other words in a corpus) which would not function properly after deleting words from random positions in the given corpus. By deleting some collocations from the given list of words, many potential functionalities and computations would be disabled. Also, computing all of these measures before the deletion, would bring a massive computation overhead which the user might not need after all.

现在,问题是如何正确使用apply_freq_filter 函数?有几种方法.下面我将展示问题及其解决方案.

Now, the question is how to correctly use the apply_freq_filter function? There are a few ways. In the following I will show the problem and its solution.

让我们定义一个示例语料库并将其拆分为与您所做的类似的单词列表:

Lets define a sample corpus and split it to a list of words similar to what you have done:

tweet_phrases = "I love iphone . I am so in love with iphone . iphone is great . samsung is great . iphone sucks. I really really love iphone cases. samsung can never beat iphone . samsung is better than apple"
from nltk.collocations import *
import nltk

为了进行实验,我将窗口大小设置为 3:

For the purpose of experimenting I set the window size to 3:

finder = BigramCollocationFinder.from_words(tweet_phrases.split(), window_size = 3)
finder1 = BigramCollocationFinder.from_words(tweet_phrases.split(), window_size = 3)

请注意,为了比较,我只使用了 finder1 上的过滤器:

Notice that for the sake of comparison I only use the filter on finder1:

finder1.apply_freq_filter(2)
bigram_measures = nltk.collocations.BigramAssocMeasures()

现在如果我写:

for k,v in finder.ngram_fd.items():
  print(k,v)

输出为:

(('.', 'is'), 3)
(('iphone', '.'), 3)
(('love', 'iphone'), 3)
(('.', 'iphone'), 2)
(('.', 'samsung'), 2)
(('great', '.'), 2)
(('iphone', 'I'), 2)
(('iphone', 'samsung'), 2)
(('is', '.'), 2)
(('is', 'great'), 2)
(('samsung', 'is'), 2)
(('.', 'I'), 1)
(('.', 'am'), 1)
(('.', 'sucks.'), 1)
(('I', 'am'), 1)
(('I', 'iphone'), 1)
(('I', 'love'), 1)
(('I', 'really'), 1)
(('I', 'so'), 1)
(('am', 'in'), 1)
(('am', 'so'), 1)
(('beat', '.'), 1)
(('beat', 'iphone'), 1)
(('better', 'apple'), 1)
(('better', 'than'), 1)
(('can', 'beat'), 1)
(('can', 'never'), 1)
(('cases.', 'can'), 1)
(('cases.', 'samsung'), 1)
(('great', 'iphone'), 1)
(('great', 'samsung'), 1)
(('in', 'love'), 1)
(('in', 'with'), 1)
(('iphone', 'cases.'), 1)
(('iphone', 'great'), 1)
(('iphone', 'is'), 1)
(('iphone', 'sucks.'), 1)
(('is', 'better'), 1)
(('is', 'than'), 1)
(('love', '.'), 1)
(('love', 'cases.'), 1)
(('love', 'with'), 1)
(('never', 'beat'), 1)
(('never', 'iphone'), 1)
(('really', 'iphone'), 1)
(('really', 'love'), 1)
(('samsung', 'better'), 1)
(('samsung', 'can'), 1)
(('samsung', 'great'), 1)
(('samsung', 'never'), 1)
(('so', 'in'), 1)
(('so', 'love'), 1)
(('sucks.', 'I'), 1)
(('sucks.', 'really'), 1)
(('than', 'apple'), 1)
(('with', '.'), 1)
(('with', 'iphone'), 1)

如果我为 finder1 编写相同的结果,我将得到相同的结果.所以,乍一看过滤器不起作用.但是,看看它是如何工作的:诀窍是使用 score_ngrams.

I will get the same result if I write the same for finder1. So, at first glance the filter doesn't work. However, see how it has worked: The trick is to use score_ngrams.

如果我在 finder 上使用 score_ngrams,它将是:

If I use score_ngrams on finder, it would be:

finder.score_ngrams (bigram_measures.pmi)

输出为:

[(('am', 'in'), 5.285402218862249), (('am', 'so'), 5.285402218862249), (('better', 'apple'), 5.285402218862249), (('better', 'than'), 5.285402218862249), (('can', 'beat'), 5.285402218862249), (('can', 'never'), 5.285402218862249), (('cases.', 'can'), 5.285402218862249), (('in', 'with'), 5.285402218862249), (('never', 'beat'), 5.285402218862249), (('so', 'in'), 5.285402218862249), (('than', 'apple'), 5.285402218862249), (('sucks.', 'really'), 4.285402218862249), (('is', 'great'), 3.7004397181410926), (('I', 'am'), 3.7004397181410926), (('I', 'so'), 3.7004397181410926), (('cases.', 'samsung'), 3.7004397181410926), (('in', 'love'), 3.7004397181410926), (('is', 'better'), 3.7004397181410926), (('is', 'than'), 3.7004397181410926), (('love', 'cases.'), 3.7004397181410926), (('love', 'with'), 3.7004397181410926), (('samsung', 'better'), 3.7004397181410926), (('samsung', 'can'), 3.7004397181410926), (('samsung', 'never'), 3.7004397181410926), (('so', 'love'), 3.7004397181410926), (('sucks.', 'I'), 3.7004397181410926), (('samsung', 'is'), 3.115477217419936), (('.', 'am'), 2.9634741239748865), (('.', 'sucks.'), 2.9634741239748865), (('beat', '.'), 2.9634741239748865), (('with', '.'), 2.9634741239748865), (('.', 'is'), 2.963474123974886), (('great', '.'), 2.963474123974886), (('love', 'iphone'), 2.7004397181410926), (('I', 'really'), 2.7004397181410926), (('beat', 'iphone'), 2.7004397181410926), (('great', 'samsung'), 2.7004397181410926), (('iphone', 'cases.'), 2.7004397181410926), (('iphone', 'sucks.'), 2.7004397181410926), (('never', 'iphone'), 2.7004397181410926), (('really', 'love'), 2.7004397181410926), (('samsung', 'great'), 2.7004397181410926), (('with', 'iphone'), 2.7004397181410926), (('.', 'samsung'), 2.37851162325373), (('is', '.'), 2.37851162325373), (('iphone', 'I'), 2.1154772174199366), (('iphone', 'samsung'), 2.1154772174199366), (('I', 'love'), 2.115477217419936), (('iphone', '.'), 1.963474123974886), (('great', 'iphone'), 1.7004397181410922), (('iphone', 'great'), 1.7004397181410922), (('really', 'iphone'), 1.7004397181410922), (('.', 'iphone'), 1.37851162325373), (('.', 'I'), 1.37851162325373), (('love', '.'), 1.37851162325373), (('I', 'iphone'), 1.1154772174199366), (('iphone', 'is'), 1.1154772174199366)]

现在注意当我为 finder1 计算相同的值时会发生什么,它被过滤到频率为 2:

Now notice what happens when I compute the same for finder1 which was filtered to a frequency of 2:

finder1.score_ngrams(bigram_measures.pmi)

和输出:

[(('is', 'great'), 3.7004397181410926), (('samsung', 'is'), 3.115477217419936), (('.', 'is'), 2.963474123974886), (('great', '.'), 2.963474123974886), (('love', 'iphone'), 2.7004397181410926), (('.', 'samsung'), 2.37851162325373), (('is', '.'), 2.37851162325373), (('iphone', 'I'), 2.1154772174199366), (('iphone', 'samsung'), 2.1154772174199366), (('iphone', '.'), 1.963474123974886), (('.', 'iphone'), 1.37851162325373)]

请注意,所有出现频率小于 2 的搭配都不在此列表中;这正是您要寻找的结果.所以过滤器起作用了.此外,文档给出了有关此问题的最小提示.

Notice that all the collocations that had a frequency of less than 2 don't exist in this list; and it's exactly the result you were looking for. So the filter has worked. Also, the documentation gives a minimal hint about this issue.

我希望这已经回答了您的问题.否则,请告诉我.

I hope this has answered your question. Otherwise, please let me know.

免责声明:如果您主要处理推文,则 13 的窗口大小太大了.如果您注意到,在我的示例语料库中,我的示例推文的大小太小,以至于应用 13 的窗口大小会导致查找不相关的搭配.

Disclaimer: If you are primarily dealing with tweets, a window size of 13 is way too big. If you noticed, in my sample corpus the size of my sample tweets were too small that applying a window size of 13 can cause finding collocations that are irrelevant.

这篇关于NLTK - Bigram 的计数频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆