从通用标签集生成PCFG [英] Generating PCFG from Universal tagset

查看:107
本文介绍了从通用标签集生成PCFG的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用从以下代码获得的POS标签构建PCFG:

I am trying to build a PCFG using the POS tags obtained from the below code:

from nltk.corpus import treebank

corpus = treebank.tagged_sents(tagset='universal')
tags = set()

for sent in corpus:
    for (word, tag) in sent: 
        tags.add(tag)

tags = list(tags)
print tags

送礼,

['ADV', 'NOUN', 'ADP', 'PRON', 'DET', '.', 'PRT', 'NUM', 'X', 'CONJ', 'ADJ', 'VERB']

我需要使用上面的POS标签生成PCFG.但是,当我尝试使用规则

I need to generate a PCFG using the POS tags above. But, when I try to construct a grammar using the rule

nltk.grammar.PCFG.fromstring("""T5 -> . NT6 [0.136235]""")

它产生

ValueError: Unable to parse line 1: T5 -> . NT6 [0.136235]
Expected a nonterminal, found: . NT6 [0.136235]

我认为该异常表示.".在nltk.grammar.PCFG中不是有效的非终结符.但是我想知道是否有一种整齐的方法可以解决此问题.

I assume that the exception indicates that "." is not a valid non-terminal in nltk.grammar.PCFG. But I am wondering if there is a neat way to fix this.

相关

nltk无法解释语法类别PRP $输出者stanford parser 提供了一个很好的解决方案,可以将treebank标签集中的"$"添加到语法中.但是,树库POS标签集再次包含单引号('')作为POS标签,这不是有效的符号.

nltk cant interpret grammar category PRP$ output by stanford parser gives a nice fix to add '$' from treebank tagset to the grammar. But then again treebank POS tagset contains single quotes (' ') as a POS tag, which is not a valid symbol.

在解决这个问题时,不必在语法中添加每个特殊字符吗?

Is there a neat work around for this problem without having to add each special character to the grammar?

推荐答案

我得到了这个问题的答案.代替使用fromstring方法,而是通过传递nltk.ProbabilisticProduction对象列表和nltk.Nonterminal对象如下所示来生成PCFG对象:

I got the answer to this question. Instead of using fromstring method, generate PCFG object by passing a list of nltk.ProbabilisticProduction objects and an nltk.Nonterminal object as below:

from nltk import ProbabilisticProduction 
from nltk.grammar import PCFG
from nltk import  Nonterminal as NT

g = ProbabilisticProduction(NT('TS'), [NT('.'), NT('NT6')], prob=1)

# Adding a terminal production
g = ProbabilisticProduction(NT('NT6'), ['terminal'], prob = 1)

start = NT('Q0')  # Q0 is the start symbol for my grammar
PCFG(start, [g]) # Takes a list of ProbabilisticProductions

这篇关于从通用标签集生成PCFG的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆