NLTK中的Ngram模型和困惑 [英] Ngram model and perplexity in NLTK

查看:651
本文介绍了NLTK中的Ngram模型和困惑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了回答我的问题,我想训练和测试/比较几种(神经)语言模型.为了专注于模型而不是数据准备,我选择使用nltk的Brown语料库并训练nltk附带的Ngrams模型作为基线(以与其他LM进行比较).

To put my question in context, I would like to train and test/compare several (neural) language models. In order to focus on the models rather than data preparation I chose to use the Brown corpus from nltk and train the Ngrams model provided with the nltk as a baseline (to compare other LM against).

所以我的第一个问题实际上是关于我觉得可疑的nltk的Ngram模型的行为. 由于代码很短,所以我将其粘贴在这里:

So my first question is actually about a behaviour of the Ngram model of nltk that I find suspicious. Since the code is rather short I pasted it here:

import nltk

print "... build"
brown = nltk.corpus.brown
corpus = [word.lower() for word in brown.words()]

# Train on 95% f the corpus and test on the rest
spl = 95*len(corpus)/100
train = corpus[:spl]
test = corpus[spl:]

# Remove rare words from the corpus
fdist = nltk.FreqDist(w for w in train)
vocabulary = set(map(lambda x: x[0], filter(lambda x: x[1] >= 5, fdist.iteritems())))

train = map(lambda x: x if x in vocabulary else "*unknown*", train)
test = map(lambda x: x if x in vocabulary else "*unknown*", test)

print "... train"
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist

estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2) 
lm = NgramModel(5, train, estimator=estimator)

print "len(corpus) = %s, len(vocabulary) = %s, len(train) = %s, len(test) = %s" % ( len(corpus), len(vocabulary), len(train), len(test) )
print "perplexity(test) =", lm.perplexity(test)

我非常怀疑的是,我得到了以下结果:

What I find very suspicious is that I get the following results:

... build
... train
len(corpus) = 1161192, len(vocabulary) = 13817, len(train) = 1103132, len(test) = 58060
perplexity(test) = 4.60298447026

困惑度为4.6,看来Ngram建模对该语料库非常好.如果我的解释是正确的,那么该模型应该平均能够在大约5次尝试中猜出正确的单词(尽管有13817种可能性……).如果您可以就这种困惑的价值分享您的经验(我真的不相信)?我在网上没有发现有关nltk的ngram模型的任何抱怨(但也许我做错了).您是否知道Ngram模型和计算复杂性可以替代NLTK?

With a perplexity of 4.6 it seems Ngram modeling is very good on that corpus. If my interpretation is correct then the model should be able to guess the correct word in roughly 5 tries on average (although there are 13817 possibilities...). If you could share your experience on the value of this perplexity (I don't really believe it)? I did not find any complaints on the ngram model of nltk on the net ( but maybe I do it wrong). Do you know a good alternatives to NLTK for Ngram models and computing perplexity?

谢谢!

推荐答案

由于使用的是五角星模型,您的困惑度很低.如果您使用bigram模型,则结果将在约50-1000(或约5到10位)的更常规范围内.

You are getting a low perplexity because you are using a pentagram model. If you'd use a bigram model your results will be in more regular ranges of about 50-1000 (or about 5 to 10 bits).

给出您的评论,您是否正在使用NLTK-3.0alpha?您不应该,至少不应该用于语言建模:

Given your comments, are you using NLTK-3.0alpha? You shouldn't, at least not for language modeling:

https://github.com/nltk/nltk/issues?labels=model

事实上,整个model模块已从NLTK-3.0a4预发行版本中删除,直到问题解决为止.

As a matter of fact, the whole model module has been dropped from the NLTK-3.0a4 pre-release until the issues are fixed.

这篇关于NLTK中的Ngram模型和困惑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆