如何解释Python NLTK bigram似然比? [英] How to interpret Python NLTK bigram likelihood ratios?

查看:288
本文介绍了如何解释Python NLTK bigram似然比?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于以下代码(摘自该

I'm trying to figure out how to properly interpret nltk's "likelihood ratio" given the below code (taken from this question).

import nltk.collocations
import nltk.corpus
import collections

bgm = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(nltk.corpus.brown.words())
scored = finder.score_ngrams(bgm.likelihood_ratio)

# Group bigrams by first word in bigram.                                        
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
    prefix_keys[key[0]].append((key[1], scores))

for key in prefix_keys:
    prefix_keys[key].sort(key = lambda x: -x[1])

prefix_keys['baseball']

具有以下输出:

[('game', 32.11075451975229),
 ('cap', 27.81891372457088),
 ('park', 23.509042621473505),
 ('games', 23.10503351305401),
 ("player's", 16.22787286342467),
 ('rightfully', 16.22787286342467),
[...]

查看文档,看起来像印在每个二元组旁边的似然比是来自

Looking at the docs, it looks like the likelihood ratio printed next to each bigram is from

"使用曼宁和舒兹的似然比对ngram进行评分 5.3.4."

"Scores ngrams using likelihood ratios as in Manning and Schutze 5.3.4."

请参阅本文,该内容在第pg页上进行了说明. 22:

Referring to this article, which states on pg. 22:

似然比的一个优点是它们具有清晰的直觉 解释.例如,功能强大的bigram计算机是 在以下假设下,e ^(.5 * 82.96)= 1.3 * 10 ^ 18倍的可能性 计算机比其基本速率更可能遵循强大的功能 发生暗示.这个数字比 t检验或2检验的分数,我们必须在 桌子.

One advantage of likelihood ratios is that they have a clear intuitive interpretation. For example, the bigram powerful computers is e^(.5*82.96) = 1.3*10^18 times more likely under the hypothesis that computers is more likely to follow powerful than its base rate of occurrence would suggest. This number is easier to interpret than the scores of the t test or the 2 test which we have to look up in a table.

如果我对自己的数据使用上面提到的nltk代码,那么我会感到困惑的是基本发生率".例如,可以肯定地说,在当前数据集中,游戏"出现在棒球"旁边的可能性是标准英语平均使用情况的32倍吗?还是在相同的数据集中,游戏"出现在棒球"旁边而不是其他单词出现在棒球"旁边?

What I'm confused about is what would be the "base rate of occurence" in the event that I'm using the nltk code noted above with my own data. Would it be safe to say, for example, that "game" is 32 times more likely to appear next to "baseball" in the current dataset than in the average use of the standard English language? Or is it that "game" is more likely to appear next to "baseball" than other words appearing next to "baseball" within the same set of data?

非常感谢您提供有关更清晰的解释或示例的帮助/指导!

Any help/guidance towards a clearer interpretation or example is much appreciated!

推荐答案

nltk没有通用的英语用法语料库,不能以此为基础来模拟棒球"之后游戏"的可能性.

nltk does not have a universal corpus of English language usage from which to model the probability of 'game' following 'baseball'.

可能性得分反映了这些结果中的每一个在语料库中以棒球"为词的可能性.

The likelihood scores reflect the likelihood, within the corpus, of each of those result grams being preceded by the word 'baseball'.

基本发生率将描述在整个语料库中棒球比赛后单词游戏发生的频率,而无需考虑整个语料库中棒球或比赛的频率.

base rate of occurrence would describe how often the word game occurs after baseball throughout the corpus, without taking into consideration the frequency of baseball or game throughout the corpus.

nltk.corpus.brown 

是内置的语料库或一组观察值,任何基于概率的模型的预测能力完全由用于构建或训练它的观察值定义.

is a built in corpus, or set of observations, and the predictive power of any probability-based model is entirely defined by the observations used to construct or train it.

根据OP评论进行更新:

UPDATE in response to OP comment:

在32%的游戏"事件中都以棒球"开头.这有点误导,并且似然得分不能直接模拟二元组的频率分布.

As in 32% of 'game' occurrences are preceded by 'baseball'. This is slightly misleading, and the likelihood score does not directly model a frequency distribution of the bigram.

nltk.collocations.BigramAssocMeasures().raw_freq

使用t检验对原始频率进行建模,因为t检验不适用于稀疏数据(例如双字母组),因此提供了似然比.

models raw frequency with t tests, not well suited to sparse data such as bigrams, thus the provision of the likelihood ratio.

Manning和Schutze计算的似然比不等于频率.

The likelihood ratio as calculated by Manning and Schutze is not equivalent to frequency.

https://nlp.stanford.edu/fsnlp/promo/colloc.pdf

第5.3.4节详细介绍了它们的计算方式.

Section 5.3.4 describes their calculations in detail on how the calculation is done.

它们以一种非常适合稀疏矩阵(如语料库矩阵)的方式,考虑了文档中单词一的出现频率,文档中单词二的出现频率和二元组的出现频率.

They take into account frequency of word one in the document, frequency of word two in the document, and frequency of the bigram in the document in a manner that is well-suited to sparse matrices like corpus matrices.

如果您熟悉TF-IDF矢量化方法,则该比率的目标是在标准化噪声特征方面达到类似的目的.

If you are familiar with the TF-IDF vectorization method, this ratio aims for something similar as far as normalizing noisy features.

分数可以无限大.得分之间的相对差异反映了我刚刚描述的那些输入(单词1,单词2和单词1word2的语料频率).

The score can be infinitely large. The relative difference between scores reflects those inputs I just described (corpus frequencies of word 1, word 2 and word1word2).

除非您是统计学家,否则此图是他们解释中最直观的部分:

This chart is the most intuitive piece of their explanation, unless you're a statistician:

似然分数被计算为最左列.

The likelihood score is calculated as the leftmost column.

这篇关于如何解释Python NLTK bigram似然比?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆