将Tokenizer与NLTK组合成语法和解析器 [英] Combining a Tokenizer into a Grammar and Parser with NLTK

查看:109
本文介绍了将Tokenizer与NLTK组合成语法和解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习NLTK书,我似乎无法做看来是构建体面语法的自然第一步的事情.

I am making my way through the NLTK book and I can't seem to do something that would appear to be a natural first step for building a decent grammar.

我的目标是为特定的文本语料库建立语法.

My goal is to build a grammar for a particular text corpus.

(最初的问题:我应该尝试从头开始语法还是应该从预定义的语法开始?如果我应该从另一个语法开始,那么英语是一个很好的语法呢?)

假设我有以下简单语法:

Suppose I have the following simple grammar:

simple_grammar = nltk.parse_cfg("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP
VP -> V NP | VP PP
Det -> 'a' | 'A'
N -> 'car' | 'door'
V -> 'has'
P -> 'in' | 'for'
 """);

此语法可以解析一个非常简单的句子,例如:

This grammar can parse a very simple sentence, such as:

parser = nltk.ChartParser(simple_grammar)
trees = parser.nbest_parse("A car has a door")

现在,我想扩展此语法以处理带有其他名词和动词的句子.如何在语法中不手动定义这些名词和动词的情况下将其添加到语法中?

Now I want to extend this grammar to handle sentences with other nouns and verbs. How do I add those nouns and verbs to my grammar without manually defining them in the grammar?

例如,假设我希望能够解析句子汽车有车轮".我知道提供的标记器可以神奇地找出哪些词是动词/名词,等等.如何使用标记器的输出来告诉语法"wheels"是一个名词?

For example, suppose I want to be able to parse the sentence "A car has wheels". I know that the supplied tokenizers can magically figure out which words are verbs/nouns, etc. How can I use the output of the tokenizer to tell the grammar that "wheels" is a noun?

推荐答案

您可以在文本上运行POS标记器,然后使语法适应于POS标记而不是单词.

You could run a POS tagger over your text and then adapt your grammar to work on POS tags instead of words.

> text = nltk.word_tokenize("A car has a door")
['A', 'car', 'has', 'a', 'door']

> tagged_text = nltk.pos_tag(text)
[('A', 'DT'), ('car', 'NN'), ('has', 'VBZ'), ('a', 'DT'), ('door', 'NN')]

> pos_tags = [pos for (token,pos) in nltk.pos_tag(text)]
['DT', 'NN', 'VBZ', 'DT', 'NN']

> simple_grammar = nltk.parse_cfg("""
  S -> NP VP
  PP -> P NP
  NP -> Det N | Det N PP
  VP -> V NP | VP PP
  Det -> 'DT'
  N -> 'NN'
  V -> 'VBZ'
  P -> 'PP'
  """)

> parser = nltk.ChartParser(simple_grammar)
> tree = parser.parse(pos_tags)

这篇关于将Tokenizer与NLTK组合成语法和解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆