nltk 中使用前瞻和回顾依赖的句子的概率树 [英] Probability tree for sentences in nltk employing both lookahead and lookback dependencies

查看:18
本文介绍了nltk 中使用前瞻和回顾依赖的句子的概率树的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

nltk 或任何其他 NLP 工具是否允许基于输入句子构建概率树,从而将输入文本的语言模型存储在字典树中,如下 示例 给出了粗略的想法,但我需要相同的功能,以便单词 Wt 不仅可以概率建模过去的输入单词(历史)Wt-n,还可以建模 Wt+m 等前瞻单词.此外,回溯和前瞻字数也应该是 2 或更多,即 bigrams 或更多.python中是否有其他库可以实现这一点?

Does nltk or any other NLP tool allow to construct probability trees based on input sentences thus storing the language model of the input text in a dictionary tree, the following example gives the rough idea, but I need the same functionality such that a word Wt does not just probabilistically modelled on past input words(history) Wt-n but also on lookahead words like Wt+m. Also the lookback and lookahead word count should also be 2 or more i.e. bigrams or more. Are there any other libraries in python which achieve this?

from collections import defaultdict
import nltk
import math

ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
    tokens = map(str.lower, nltk.word_tokenize(sentence))
    for token, next_token in zip(tokens, tokens[1:]):
        ngram[token][next_token] += 1
for token in ngram:
    total = math.log10(sum(ngram[token].values()))
    ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}

该解决方案需要前瞻和回顾,一个特殊的子类字典可能有助于解决这个问题.也可以指向谈论实施这样一个系统的相关资源.nltk.models 似乎正在做类似的事情,但不再可用.NLP 中是否有任何现有的设计模式实现了这个想法?基于skip gram的模型也与这个想法类似,但我觉得这应该已经在某处实现了.

the solution requires both lookahead and lookback and a specially sub classed dictionary may help in solving this problem. Can also point to relevant resources which talk about implementing such a system. A nltk.models seemed to be doing something similar but is no longer available. Are there any existing design patterns in NLP which implement this idea? skip gram based models are similar to this idea too but I feel this has should have been implemented already somewhere.

推荐答案

如果我正确理解了您的问题,那么您正在寻找一种方法来预测给定周围上下文(不仅是后向上下文,还包括前向上下文)的单词的概率语境).为了您的目的,一个快速的技巧是训练两种不同的语言模型.一个从右到左,另一个从左到右,然后给定上下文的单词的概率将是前向和后向上下文的归一化总和.

If I understand your question correctly, you are looking for a way to predict the probability of a word given its surrounding context (not just backward context but also the forward context). One quick hack for your purpose is to train two different language models. One from right to left and the other from left to right and then probability of a word given its context would be the normalized sum of both forward and backward contexts.

扩展您的代码:

from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
import numpy as np


ngram = defaultdict(lambda: defaultdict(int))
ngram_rev = defaultdict(lambda: defaultdict(int)) #reversed n-grams
corpus = "The cat is cute. He jumps and he is happy."

for sentence in nltk.sent_tokenize(corpus):
    tokens = map(str.lower, nltk.word_tokenize(sentence))
    for token, next_token in zip(tokens, tokens[1:]):
        ngram[token][next_token] += 1
    for token, rev_token in zip(tokens[1:], tokens):
        ngram_rev[token][rev_token] += 1
for token in ngram:
    total = np.log(np.sum(ngram[token].values()))
    total_rev = np.log(np.sum(ngram_rev[token].values()))
    ngram[token] = {nxt: np.log(v) - total 
                    for nxt, v in ngram[token].items()}
    ngram_rev[token] = {prv: np.log(v) - total_rev 
                    for prv, v in ngram_rev[token].items()}

现在上下文在分别保存前向和后向上下文的 ngram 和 ngram_rev 中.

Now the context is in both ngram and ngram_rev which respectively hold the forward and backward contexts.

您还应该考虑平滑.也就是说,如果在你的训练语料库中没有看到给定的短语,你只会得到零概率.为了避免这种情况,有许多平滑技术,其中最简单的是附加组件平滑.

You should also account for smoothing. That is if a given phrase is not seen in your training corpus, you would just get zero probabilities. In order to avoid that, there are many smoothing techniques the most simple of which is the add-on smoothing.

这篇关于nltk 中使用前瞻和回顾依赖的句子的概率树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆