nltk中的句子的概率树使用了前瞻和回溯依赖 [英] Probability tree for sentences in nltk employing both lookahead and lookback dependencies

查看:144
本文介绍了nltk中的句子的概率树使用了前瞻和回溯依赖的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

nltk或任何其他NLP工具允许基于输入句子构建概率树,从而将输入文本的语言模型存储在字典树中,以下示例给出了粗略的想法,但是我需要相同的功能,使得一个单词Wt不仅仅是对过去的输入单词(历史)Wt-n进行概率建模,而且还在诸如Wt + m 。此外,回溯和前瞻字数也应为2或更大,即二进制或更多。在python中有没有其他的库可以实现?

 从集合import defaultdict 
import nltk
import数学

ngram = defaultdict(lambda:defaultdict(int))
corpus =猫很可爱,他跳了起来,他很开心。
为nltk.sent_tokenize(语料库)中的句子:
tokens = map(str.lower,nltk.word_tokenize(sentence))
用于令牌,next_token在zip中(令牌,令牌[1: ]):
ngram [token] [next_token] + = 1
为ngram中的令牌:
total = math.log10(sum(ngram [token] .values()))
ngram [token] = {nxt:math.log10(v) - nxt的总和,v in ngram [token] .items()}

该解决方案需要前瞻和回溯,特别是分类的字典可能有助于解决这个问题。也可以指出有关实施这样一个系统的相关资源。一个nltk.models似乎在做类似的事情,但不再可用。 NLP中是否存在实现这一想法的现有设计模式?跳过基于模式的模型也与这个想法类似,但我觉得这应该已经在某个地方实现了。

解决方案

如果我明白你的问题是正确的,你正在寻找一种方法来预测一个单词给出其周围环境的概率(不仅仅是向后的上下文,而且是前向上下文)。
为了您的目的,一个快速攻击是培训两种不同的语言模型。一个从右到左,另一个从左到右,然后给出其上下文的单词的概率将是前向和后向上下文的归一化总和。



扩展代码:

 从集合import defaultdict 
import nltk
from nltk.tokenize import word_tokenize
import numpy as np


ngram = defaultdict(lambda:defaultdict(int))
ngram_rev = defaultdict(lambda:defaultdict(int))#reversed n-gram
corpus =猫很可爱,他跳起来,他很开心。

为nltk.sent_tokenize(语料库)中的句子:
tokens = map(str.lower,nltk.word_tokenize(sentence))
为token,next_token为zip(令牌,令牌[1:]):
ngram [token] [next_token] + = 1
为令牌,rev_token为zip(令牌[1:],令牌)):
ngram_rev [token] [ $ n
total = np.log(np.sum(ngram [token] .values()))
total_rev = np.log(np。 sum(ngram_rev [token] .values()))
ngram [token] = {nxt:np.log(v) - 总共
为nxt,v为ngram [token] .items()}
ngram_rev [token] = {prv:np.log(v) - total_rev
for prv,v in ngram_rev [token] .items()}
/ pre>

现在上下文分别保存前向和后向上下文的ngram和ngram_rev。



您还应该考虑平滑。那就是说,如果在你的训练语料库中没有看到一个给定的短语,你就会得到零概率。为了避免这种情况,有许多平滑技术最简单的是加载项平滑。


Does nltk or any other NLP tool allow to construct probability trees based on input sentences thus storing the language model of the input text in a dictionary tree, the following example gives the rough idea, but I need the same functionality such that a word Wt does not just probabilistically modelled on past input words(history) Wt-n but also on lookahead words like Wt+m. Also the lookback and lookahead word count should also be 2 or more i.e. bigrams or more. Are there any other libraries in python which achieve this?

from collections import defaultdict
import nltk
import math

ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
    tokens = map(str.lower, nltk.word_tokenize(sentence))
    for token, next_token in zip(tokens, tokens[1:]):
        ngram[token][next_token] += 1
for token in ngram:
    total = math.log10(sum(ngram[token].values()))
    ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}

the solution requires both lookahead and lookback and a specially sub classed dictionary may help in solving this problem. Can also point to relevant resources which talk about implementing such a system. A nltk.models seemed to be doing something similar but is no longer available. Are there any existing design patterns in NLP which implement this idea? skip gram based models are similar to this idea too but I feel this has should have been implemented already somewhere.

解决方案

If I understand your question correctly, you are looking for a way to predict the probability of a word given its surrounding context (not just backward context but also the forward context). One quick hack for your purpose is to train two different language models. One from right to left and the other from left to right and then probability of a word given its context would be the normalized sum of both forward and backward contexts.

Extending your code:

from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
import numpy as np


ngram = defaultdict(lambda: defaultdict(int))
ngram_rev = defaultdict(lambda: defaultdict(int)) #reversed n-grams
corpus = "The cat is cute. He jumps and he is happy."

for sentence in nltk.sent_tokenize(corpus):
    tokens = map(str.lower, nltk.word_tokenize(sentence))
    for token, next_token in zip(tokens, tokens[1:]):
        ngram[token][next_token] += 1
    for token, rev_token in zip(tokens[1:], tokens):
        ngram_rev[token][rev_token] += 1
for token in ngram:
    total = np.log(np.sum(ngram[token].values()))
    total_rev = np.log(np.sum(ngram_rev[token].values()))
    ngram[token] = {nxt: np.log(v) - total 
                    for nxt, v in ngram[token].items()}
    ngram_rev[token] = {prv: np.log(v) - total_rev 
                    for prv, v in ngram_rev[token].items()}

Now the context is in both ngram and ngram_rev which respectively hold the forward and backward contexts.

You should also account for smoothing. That is if a given phrase is not seen in your training corpus, you would just get zero probabilities. In order to avoid that, there are many smoothing techniques the most simple of which is the add-on smoothing.

这篇关于nltk中的句子的概率树使用了前瞻和回溯依赖的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆