NLTK-双曲的计数频率 [英] NLTK - Counting Frequency of Bigram

查看:74
本文介绍了NLTK-双曲的计数频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个Python和NLTK新手问题.

This is a Python and NLTK newbie question.

我想找到一起发生10次以上且具有最高PMI的二元组的频率.

I want to find frequency of bigrams which occur more than 10 times together and have the highest PMI.

为此,我正在使用此代码

For this, I am working with this code

def get_list_phrases(text):

    tweet_phrases = []

    for tweet in text:
        tweet_words = tweet.split()
        tweet_phrases.extend(tweet_words)


    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(tweet_phrases,window_size = 13)
    finder.apply_freq_filter(10)
    finder.nbest(bigram_measures.pmi,20)  

    for k,v in finder.ngram_fd.items():
      print(k,v)

但是,这并不会将结果限制在前20位.我看到的结果是< 10.我是Python世界的新手.

However, this does not restricts the results to top 20. I see results which have frequency < 10. I am new to the world of Python.

有人可以指出如何修改它以仅获得前20名.

Can someone please point out how to modify this to get only the top 20.

谢谢

推荐答案

问题与您尝试使用apply_freq_filter的方式有关. 我们正在讨论单词搭配.如您所知,单词并置是关于单词之间的依存关系. BigramCollocationFinder类继承自名为AbstractCollocationFinder的类,而函数apply_freq_filter属于该类. apply_freq_filter不应完全删除某些单词搭配,但如果某些其他功能尝试访问列表,则提供过滤后的搭配列表.

The problem is with the way you are trying to use apply_freq_filter. We are discussing about word collocations. As you know, a word collocation is about dependency between words. The BigramCollocationFinder class inherits from a class named AbstractCollocationFinder and the function apply_freq_filter belongs to this class. apply_freq_filter is not supposed to totally delete some word collocations, but to provide a filtered list of collocations if some other functions try to access the list.

现在为什么呢?想象一下,如果过滤并置只是简单地删除它们,那么就有很多概率测度,例如似然比或PMI本身(计算一个词相对于语料库中其他词的概率),这些词在从随机位置删除多个词后将无法正常运行在给定的语料库中. 通过从给定单词列表中删除一些搭配,将禁用许多潜在的功能和计算. 同样,在删除之前计算所有这些度量,将带来巨大的计算开销,而用户可能根本不需要.

Now why is that? Imagine that if filtering collocations was simply deleting them, then there were many probability measures such as likelihood ratio or the PMI itself (that compute probability of a word relative to other words in a corpus) which would not function properly after deleting words from random positions in the given corpus. By deleting some collocations from the given list of words, many potential functionalities and computations would be disabled. Also, computing all of these measures before the deletion, would bring a massive computation overhead which the user might not need after all.

现在,问题是如何正确使用apply_freq_filter function?有几种方法.在下文中,我将展示该问题及其解决方案.

Now, the question is how to correctly use the apply_freq_filter function? There are a few ways. In the following I will show the problem and its solution.

让我们定义一个示例语料库,并将其拆分为类似于您所做的单词的列表:

Lets define a sample corpus and split it to a list of words similar to what you have done:

tweet_phrases = "I love iphone . I am so in love with iphone . iphone is great . samsung is great . iphone sucks. I really really love iphone cases. samsung can never beat iphone . samsung is better than apple"
from nltk.collocations import *
import nltk

出于实验目的,我将窗口大小设置为3:

For the purpose of experimenting I set the window size to 3:

finder = BigramCollocationFinder.from_words(tweet_phrases.split(), window_size = 3)
finder1 = BigramCollocationFinder.from_words(tweet_phrases.split(), window_size = 3)

请注意,为便于比较,我仅在finder1上使用过滤器:

Notice that for the sake of comparison I only use the filter on finder1:

finder1.apply_freq_filter(2)
bigram_measures = nltk.collocations.BigramAssocMeasures()

现在,如果我写:

for k,v in finder.ngram_fd.items():
  print(k,v)

输出为:

(('.', 'is'), 3)
(('iphone', '.'), 3)
(('love', 'iphone'), 3)
(('.', 'iphone'), 2)
(('.', 'samsung'), 2)
(('great', '.'), 2)
(('iphone', 'I'), 2)
(('iphone', 'samsung'), 2)
(('is', '.'), 2)
(('is', 'great'), 2)
(('samsung', 'is'), 2)
(('.', 'I'), 1)
(('.', 'am'), 1)
(('.', 'sucks.'), 1)
(('I', 'am'), 1)
(('I', 'iphone'), 1)
(('I', 'love'), 1)
(('I', 'really'), 1)
(('I', 'so'), 1)
(('am', 'in'), 1)
(('am', 'so'), 1)
(('beat', '.'), 1)
(('beat', 'iphone'), 1)
(('better', 'apple'), 1)
(('better', 'than'), 1)
(('can', 'beat'), 1)
(('can', 'never'), 1)
(('cases.', 'can'), 1)
(('cases.', 'samsung'), 1)
(('great', 'iphone'), 1)
(('great', 'samsung'), 1)
(('in', 'love'), 1)
(('in', 'with'), 1)
(('iphone', 'cases.'), 1)
(('iphone', 'great'), 1)
(('iphone', 'is'), 1)
(('iphone', 'sucks.'), 1)
(('is', 'better'), 1)
(('is', 'than'), 1)
(('love', '.'), 1)
(('love', 'cases.'), 1)
(('love', 'with'), 1)
(('never', 'beat'), 1)
(('never', 'iphone'), 1)
(('really', 'iphone'), 1)
(('really', 'love'), 1)
(('samsung', 'better'), 1)
(('samsung', 'can'), 1)
(('samsung', 'great'), 1)
(('samsung', 'never'), 1)
(('so', 'in'), 1)
(('so', 'love'), 1)
(('sucks.', 'I'), 1)
(('sucks.', 'really'), 1)
(('than', 'apple'), 1)
(('with', '.'), 1)
(('with', 'iphone'), 1)

如果我为finder1编写相同的结果,我将获得相同的结果.因此,乍一看,过滤器不起作用.但是,看看它是如何工作的:诀窍是使用score_ngrams.

I will get the same result if I write the same for finder1. So, at first glance the filter doesn't work. However, see how it has worked: The trick is to use score_ngrams.

如果我在finder上使用score_ngrams,它将是:

If I use score_ngrams on finder, it would be:

finder.score_ngrams (bigram_measures.pmi)

,输出为:

[(('am', 'in'), 5.285402218862249), (('am', 'so'), 5.285402218862249), (('better', 'apple'), 5.285402218862249), (('better', 'than'), 5.285402218862249), (('can', 'beat'), 5.285402218862249), (('can', 'never'), 5.285402218862249), (('cases.', 'can'), 5.285402218862249), (('in', 'with'), 5.285402218862249), (('never', 'beat'), 5.285402218862249), (('so', 'in'), 5.285402218862249), (('than', 'apple'), 5.285402218862249), (('sucks.', 'really'), 4.285402218862249), (('is', 'great'), 3.7004397181410926), (('I', 'am'), 3.7004397181410926), (('I', 'so'), 3.7004397181410926), (('cases.', 'samsung'), 3.7004397181410926), (('in', 'love'), 3.7004397181410926), (('is', 'better'), 3.7004397181410926), (('is', 'than'), 3.7004397181410926), (('love', 'cases.'), 3.7004397181410926), (('love', 'with'), 3.7004397181410926), (('samsung', 'better'), 3.7004397181410926), (('samsung', 'can'), 3.7004397181410926), (('samsung', 'never'), 3.7004397181410926), (('so', 'love'), 3.7004397181410926), (('sucks.', 'I'), 3.7004397181410926), (('samsung', 'is'), 3.115477217419936), (('.', 'am'), 2.9634741239748865), (('.', 'sucks.'), 2.9634741239748865), (('beat', '.'), 2.9634741239748865), (('with', '.'), 2.9634741239748865), (('.', 'is'), 2.963474123974886), (('great', '.'), 2.963474123974886), (('love', 'iphone'), 2.7004397181410926), (('I', 'really'), 2.7004397181410926), (('beat', 'iphone'), 2.7004397181410926), (('great', 'samsung'), 2.7004397181410926), (('iphone', 'cases.'), 2.7004397181410926), (('iphone', 'sucks.'), 2.7004397181410926), (('never', 'iphone'), 2.7004397181410926), (('really', 'love'), 2.7004397181410926), (('samsung', 'great'), 2.7004397181410926), (('with', 'iphone'), 2.7004397181410926), (('.', 'samsung'), 2.37851162325373), (('is', '.'), 2.37851162325373), (('iphone', 'I'), 2.1154772174199366), (('iphone', 'samsung'), 2.1154772174199366), (('I', 'love'), 2.115477217419936), (('iphone', '.'), 1.963474123974886), (('great', 'iphone'), 1.7004397181410922), (('iphone', 'great'), 1.7004397181410922), (('really', 'iphone'), 1.7004397181410922), (('.', 'iphone'), 1.37851162325373), (('.', 'I'), 1.37851162325373), (('love', '.'), 1.37851162325373), (('I', 'iphone'), 1.1154772174199366), (('iphone', 'is'), 1.1154772174199366)]

现在请注意,当我为被过滤为2的频率的finder1计算相同值时会发生什么情况.

Now notice what happens when I compute the same for finder1 which was filtered to a frequency of 2:

finder1.score_ngrams(bigram_measures.pmi)

和输出:

[(('is', 'great'), 3.7004397181410926), (('samsung', 'is'), 3.115477217419936), (('.', 'is'), 2.963474123974886), (('great', '.'), 2.963474123974886), (('love', 'iphone'), 2.7004397181410926), (('.', 'samsung'), 2.37851162325373), (('is', '.'), 2.37851162325373), (('iphone', 'I'), 2.1154772174199366), (('iphone', 'samsung'), 2.1154772174199366), (('iphone', '.'), 1.963474123974886), (('.', 'iphone'), 1.37851162325373)]

请注意,此列表中不存在频率小于2的所有搭配.而这正是您要寻找的结果.因此过滤器已经起作用.此外,文档还提供了有关此问题的最小提示.

Notice that all the collocations that had a frequency of less than 2 don't exist in this list; and it's exactly the result you were looking for. So the filter has worked. Also, the documentation gives a minimal hint about this issue.

我希望这能回答您的问题.否则,请让我知道.

I hope this has answered your question. Otherwise, please let me know.

免责声明:如果您主要处理推文,则窗口大小为13太大.如果您注意到了,在我的示例语料库中,我的示例推文的大小太小,以至于将窗口大小设置为13会导致查找不相关的搭配.

Disclaimer: If you are primarily dealing with tweets, a window size of 13 is way too big. If you noticed, in my sample corpus the size of my sample tweets were too small that applying a window size of 13 can cause finding collocations that are irrelevant.

这篇关于NLTK-双曲的计数频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆