NLTK困惑度测度反演 [英] NLTK Perplexity measure inversion

查看:168
本文介绍了NLTK困惑度测度反演的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我给了火车文字和测试文字.我要做的是通过训练数据来训练语言模型,以计算测试数据的困惑度.

I have given a train text and a test text. What I want to do is to train a language model by train data to calculate the perplexity of the test data.

这是我的代码:

import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends

from nltk import word_tokenize, sent_tokenize 

fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
        textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
    with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
        text = fin.read()

# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                for sent in sent_tokenize(text)]

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace

n = 1
padded_bigrams = list(pad_both_ends(word_tokenize(textTest), n=1))
trainTest = everygrams(padded_bigrams, min_len = n , max_len=n);

train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)


model = Laplace(n) 
model.fit(train_data, padded_sents)

print(model.perplexity(trainTest)) 

当我以n = 1(单字)运行此代码时,得到"1068.332393940235".在n = 2或二元组的情况下,我得到"1644.3441077259993",在三元组的情况下我得到2552.2085752565313.

When I run this code with n=1, which is unigram, I get "1068.332393940235". With n=2, or bigram, I get "1644.3441077259993", and with trigrams I get 2552.2085752565313.

这是什么问题?

推荐答案

创建测试数据的方式是错误的(小写的训练数据,但测试数据未转换为小写.测试数据中缺少起始和结束标记) .试试这个

The way you are creating the test data is wrong (lower case train data but test data is not coverted to lowercase. Start and end tokens missing in test data). Try this

import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace
from nltk import word_tokenize, sent_tokenize 

"""
fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
        textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
    with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
        text = fin.read()
"""
textTest = "This is an ant. This is a cat"
text = "This is an orange. This is a mango"

n = 2
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                for sent in sent_tokenize(text)]
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                for sent in sent_tokenize(textTest)]
test_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

model = Laplace(1) 
model.fit(train_data, padded_sents)

s = 0
for i, test in enumerate(test_data):
    p = model.perplexity(test)
    s += p

print ("Perplexity: {0}".format(s/(i+1)))

这篇关于NLTK困惑度测度反演的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆