NLTK程序包估计(字母组合)的困惑 [英] NLTK package to estimate the (unigram) perplexity

查看:195
本文介绍了NLTK程序包估计(字母组合)的困惑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为我拥有的数据计算困惑度.我使用的代码是:

I am trying to calculate the perplexity for the data I have. The code I am using is:

 import sys
 sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")

from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm

但是我收到错误消息,

File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'

我已经对我拥有的数据执行了潜在Dirichlet分配,并且生成了unigram及其各自的概率(将它们归一化为数据的总概率之和为1).

I have already performed Latent Dirichlet Allocation for the data I have and I have generated the unigrams and their respective probabilities (they are normalized as the sum of total probabilities of the data is 1).

我的美术字及其概率如下:

My unigrams and their probability looks like:

Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781

这只是我拥有的unigram文件的一部分.大约1000的行遵循相同的格式.总计的总概率(第二列)为1.

This is just a fragment of the unigrams file I have. The same format is followed for about 1000s of lines. The total probabilities (second column) summed gives 1.

我是一个崭露头角的程序员.这个ngram.py属于nltk软件包,我对如何纠正它感到困惑.我这里的示例代码来自nltk文档,我现在不知道该怎么办.请帮忙我能做的.预先感谢!

I am a budding programmer. This ngram.py belongs to the nltk package and I am confused as to how to rectify this. The sample code I have here is from the nltk documentation and I don't know what to do now. Please help on what I can do. Thanks in advance!

推荐答案

困惑度是测试集的逆概率,通过单词数进行归一化.就字母组合而言:

Perplexity is the inverse probability of the test set, normalized by the number of words. In the case of unigrams:

现在,您说您已经构建了会标模型,也就是说,对于每个单词,您都有相关的概率.然后,您只需要应用公式.我假设您有一个大词典unigram[word],它将提供语料库中每个单词的概率.您还需要测试集.如果您的unigram模型不是字典形式的,请告诉我您使用了哪种数据结构,以便我可以相应地将其适应于我的解决方案.

Now you say you have already constructed the unigram model, meaning, for each word you have the relevant probability. Then you only need to apply the formula. I assume you have a big dictionary unigram[word] that would provide the probability of each word in the corpus. You also need to have a test set. If your unigram model is not in the form of a dictionary, tell me what data structure you have used, so I could adapt it to my solution accordingly.

perplexity = 1
N = 0

for word in testset:
    if word in unigram:
        N += 1
        perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))

更新:

当您要求一个完整的工作示例时,这是一个非常简单的示例.

As you asked for a complete working example, here's a very simple one.

假设这是我们的语料库:

Suppose this is our corpus:

corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""

这是我们首先构建unigram模型的方式:

Here's how we construct the unigram model first:

import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)

#here you construct the unigram language model 
def unigram(tokens):    
    model = collections.defaultdict(lambda: 0.01)
    for f in tokens:
        try:
            model[f] += 1
        except KeyError:
            model [f] = 1
            continue
    N = float(sum(model.values()))
    for word in model:
        model[word] = model[word]/N
    return model

我们的模型在这里得到了平滑.对于超出其知识范围的单词,它赋予0.01的低概率.我已经告诉过您如何计算困惑度:

Our model here is smoothed. For words outside the scope of its knowledge, it assigns a low probability of 0.01. I already told you how to compute perplexity:

#computes perplexity of the unigram model on a testset  
def perplexity(testset, model):
    testset = testset.split()
    perplexity = 1
    N = 0
    for word in testset:
        N += 1
        perplexity = perplexity * (1/model[word])
    perplexity = pow(perplexity, 1/float(N)) 
    return perplexity

现在我们可以在两个不同的测试集上对此进行测试:

Now we can test this on two different test sets:

testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"

model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)

您将获得以下结果:

>>> 
49.09452736318415
99.99999999999997

请注意,在处理困惑时,我们尝试减少它.对于某种测试集,具有较少困惑的语言模型比具有更大困惑的语言模型更为可取.在第一个测试集中,单词Monty包含在unigram模型中,因此,每个困惑的数目也较小.

Note that when dealing with perplexity, we try to reduce it. A language model that has less perplexity with regards to a certain test set is more desirable than one with a bigger perplexity. In the first test set, the word Monty was included in the unigram model, so the respective number for perplexity was also smaller.

这篇关于NLTK程序包估计(字母组合)的困惑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆