使用Python NLTK的Trigram的Kneser-Ney平滑 [英] Kneser-Ney smoothing of trigrams using Python NLTK

查看:792
本文介绍了使用Python NLTK的Trigram的Kneser-Ney平滑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Python NLTK通过Kneser-Ney平滑来平滑一组n-gram概率. 不幸的是,整个文档很少.

I'm trying to smooth a set of n-gram probabilities with Kneser-Ney smoothing using the Python NLTK. Unfortunately, the whole documentation is rather sparse.

我要这样做的是:我将文本解析为三元组元组的列表.从此列表中,我创建一个FreqDist,然后使用该FreqDist计算KN平滑分布.

What I'm trying to do is this: I parse a text into a list of tri-gram tuples. From this list I create a FreqDist and then use that FreqDist to calculate a KN-smoothed distribution.

我很确定,结果是完全错误的.当我总结各个概率时,我得到的东西超出了1.以下面的代码示例为例:

I'm pretty sure though, that the result is totally wrong. When I sum up the individual probabilities I get something way beyond 1. Take this code example:

import nltk

ngrams = nltk.trigrams("What a piece of work is man! how noble in reason! how infinite in faculty! in \
form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
the beauty of the world, the paragon of animals!")

freq_dist = nltk.FreqDist(ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)
prob_sum = 0
for i in kneser_ney.samples():
    prob_sum += kneser_ney.prob(i)
print(prob_sum)

输出为"41.51696428571428".根据语料库大小,此值会无限增大.这使得除了prob()以外,其他任何东西都不会返回概率分布.

The output is "41.51696428571428". Depending on the corpus size, this value grows infinitely large. That makes whatever prob() returns anything but a probability distribution.

看一下NLTK代码,我会说实现是有问题的.也许我只是不了解应该如何使用该代码.在这种情况下,您能给我个提示吗?在任何其他情况下:您知道任何有效的Python实现吗?我真的不想自己实现它.

Looking at the NLTK code I would say that the implementation is questionable. Maybe I just don't understand how the code is supposed to be used. In that case, could you give me a hint please? In any other case: do you know any working Python implementation? I don't really want to implement it myself.

推荐答案

Goodman和Chen (对不同的平滑技术进行了出色的调查)是一种非常复杂的平滑,据我所知只有少数软件包可以正确处理.不知道任何python实现,但是如果您只需要概率,可以肯定尝试 SRILM

The Kneser-Ney (also have a look at Goodman and Chen for a great survey on different smoothing techniques) is a quite complicated smoothing which only a few package that I am aware of got it right. Not aware of any python implementation, but you can definitely try SRILM if you just need probabilities, etc.

  • 您的样本很有可能在训练数据中没有出现过的单词(又称​​词汇量(OOV)单词),如果处理不当,可能会弄乱单词您获得的概率.也许这会导致变得异常大而无效的概率?
  • There is a good chance that your sample has words that didn't occur in training data (aka Out-Of-Vocabulary (OOV) words), which if not handled properly can mess up the probabilities you get. Perhaps this can cause getting outrageously large and invalid prob?

这篇关于使用Python NLTK的Trigram的Kneser-Ney平滑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆