从cyk提取概率和最有可能解析树 [英] Extract probabilities and most likely parse tree from cyk

查看:123
本文介绍了从cyk提取概率和最有可能解析树的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了理解cyk算法,我通过以下示例进行了研究:

In order to understand cyk algorithm I've worked through example on : https://www.youtube.com/watch?v=VTH1k-xiswM&feature=youtu.be .

其结果是:

如何提取与每个解析相关的概率并提取最可能的解析树?

How do I extract the probabilities associated with each parse and extract the most likely parse tree ?

推荐答案

对于PCFG,这是两个不同的问题:

These are two distinct problems for PCFG:

  • 识别:该句子是否属于CFG生成的语言? (输出:是或否)
  • 分析:这句话的最高评分树是什么? (输出:解析树)
  • recognition: does the sentence belong to the language generated by the CFG? (output: yes or no)
  • parsing: what is the highest scoring tree for this sentence? (output: parse tree)

问题中链接的CKY算法视频仅处理识别问题.要同时执行解析问题, 我们要 (i)维持每个解析项目的分数,以及 (ii)跟踪层次关系(例如,如果使用规则S-> NP VP:我们必须跟踪使用哪个NP和哪个VP来预测S).

The CKY algorithm video linked in the question only deals with the recognition problem. To perform the parsing problem simultaneously, we need to (i) maintain the score of each parsing item and (ii) keep track of the hierarchical relationships (e.g. if we use the rule S -> NP VP: we must keep track of which NP and which VP are used to predict S).

符号:

  • 解析项目 [X, i, j]: s表示存在一个标记为X的节点,其跨度为令牌 i (包括)到 j (排除)得分 s . 分数是植根于X的子树的对数概率.
  • 句子是单词w_1 w_2 ... w_n的序列.
  • A parsing item [X, i, j]: s means that there is a node labelled X spanning token i (included) to j (excluded) with score s. The score is the log probability of the subtree rooted in X.
  • The sentence is a sequence of words w_1 w_2 ... w_n.

从抽象的角度来看,PCFG解析可以表述为一组推导规则:

At an abstract level, PCFG parsing can be formulated as a set of deduction rules:

  1. 扫描规则(读取令牌)

  1. Scan Rule (read tokens)

____________________________{precondition: X -> w_i is a grammar rule
[X, i, i+1]: log p(X -> w_i)

光泽度:如果语法中有规则X -> w_i,那么我们可以在图表中添加项目[X, i, i+1].

Gloss: if there is a rule X -> w_i in the grammar, then we can add the item [X, i, i+1] in the chart.

完整规则(自下而上创建新的组成部分)

Complete Rule (create a new constituent bottom up)

[X, i, k]: s1     [Y, k, j]: s2
_____________________________________{precondition: Z -> X Y is a grammar rule
[Z, i, j]: s1 + s2 + log p(Z -> X, Y)

光泽度:如果图表中有两个解析项[X, i, k][Y, k, j],并且在语法中有规则Z -> X Y,那么我们可以在图表中添加[Z, i, j].

Gloss: if there are 2 parsing items [X, i, k] and [Y, k, j] in the chart, and a rule Z -> X Y in the grammar, then we can add [Z, i, j] to the chart.

加权分析的目的是推导分数最高的s(S:公理,n:句子的长度).

The goal of weighted parsing is to deduce the parsing item [S, 0, n]:s (S: axiom, n: length of sentence) with the highest score s.

整个算法的伪代码

# The chart stores parsing items and their scores
chart[beginning(int)][end(int)][NonTerminal][score(float)]

# the backtrack table is used to recover the parse tree at the end
backtrack[beginning][end][NonTerminal][item_left, item_right]

# insert a new item in the chart
# for a given span (i, j) and nonterminal X, we only need to
# keep the single best scoring item.
def insert(X, i, j, score, Y, Z, k):
    if X not in chart[i][j] or chart[i][j][X] < score             
        chart[i][j][X] <- score
        backtrack[i][j][X] <- (Y, i, k), (Z, k, j)

n <- length of sentence

for i in range(0, n):
    # apply scan rule
    insert(X, i, i+1, log p(X -> w_i)) for each grammar rule X -> w_i

for span_length in range(2, n):
    for beginning in range(0, n - span_length):
        end <- beginning + span_length

        for k in range(beginning+1, end -1):
            # apply completion rules
            for each grammar rule X -> Y Z such that 
                * Y is in chart[beginning][k]
                * Z is in chart[k][end]

                score_left <- chart[beginning][k][Y]
                score_right <- chart[k][end][Z]

                insert(X, beginning, end, log p(X -> Y Z) + score_left + score_right)

if there is S (axiom) in chart[0][n], then parsing is successful
    the score (log probability) of the best tree is chart[0][n][S]
    the best tree can be recovered recursively by following pointers from backtrack[0][n][S]
else:
    parsing failure, the sentence does not belong to the language
    generated by the grammar

一些注意事项:

这篇关于从cyk提取概率和最有可能解析树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆