高效的无上下文语法分析器,最好是Python友好的 [英] Efficient Context-Free Grammar parser, preferably Python-friendly

查看:98
本文介绍了高效的无上下文语法分析器,最好是Python友好的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要为我的一个项目解析一小部分英语,描述为具有(1级)特征结构的上下文无关语法(

I am in need of parsing a small subset of English for one of my project, described as a context-free grammar with (1-level) feature structures (example) and I need to do it efficiently .

现在,我正在使用 NLTK 的解析器,该解析器可以生成正确的输出,但是速度很慢.对于我大约450条相当模糊的非词典规则和50万条词汇条目的语法,解析简单的句子可能需要2到30秒的时间,具体取决于结果树的数量.词法条目对性能几乎没有影响.

Right now I'm using NLTK's parser which produces the right output but is very slow. For my grammar of ~450 fairly ambiguous non-lexicon rules and half a million lexical entries, parsing simple sentences can take anywhere from 2 to 30 seconds, depending it seems on the number of resulting trees. Lexical entries have little to no effect on performance.

另一个问题是,一开始要加载(25MB)语法和词典可能需要一分钟的时间.

Another problem is that loading the (25MB) grammar+lexicon at the beginning can take up to a minute.

从我的文献中可以发现,用于解析此类语法(Earley或CKY)的算法的运行时间应与语法大小成线性关系,与输入令牌列表的大小呈立方关系.我在NLTK方面的经验表明,歧义是最不利于性能的因素,而不是语法的绝对大小.

From what I can find in literature, the running time of the algorithm used to parse such a grammar (Earley or CKY) should be linear to the size of the grammar and cubic to the size of the input token list. My experience with NLTK indicates that ambiguity is what hurts the performance most, not the absolute size of the grammar.

所以现在我正在寻找CFG解析器来代替NLTK.我一直在考虑 PLY ,但我无法确定它是否支持CFG中的功能结构,这是我所需要的,而且我所看到的示例似乎在进行很多过程解析,而不仅仅是指定语法.有人可以给我展示一个PLY既支持功能结构又使用声明式语法的示例吗?

So now I'm looking for a CFG parser to replace NLTK. I've been considering PLY but I can't tell whether it supports feature structures in CFGs, which are required in my case, and the examples I've seen seem to be doing a lot of procedural parsing rather than just specifying a grammar. Can anybody show me an example of PLY both supporting feature structs and using a declarative grammar?

我也可以使用其他可以有效完成我所需要的解析器.最好使用Python接口,但并非绝对必要.

I'm also fine with any other parser that can do what I need efficiently. A Python interface is preferable but not absolutely necessary.

推荐答案

从所有方面来看 Pyparsing .这是我遇到的最Python化的解析实现,从纯粹的学术角度来看,这是一个很棒的设计.

By all means take a look at Pyparsing. It's the most pythonic implementations of parsing I've come across, and it's a great design from a purely academic standpoint.

我同时使用了 ANTLR JavaCC 可以在本地大学教授翻译和编译器理论.它们既好又成熟,但是我不会在Python项目中使用它们.

I used both ANTLR and JavaCC to teach translator and compiler theory at a local university. They're both good and mature, but I wouldn't use them in a Python project.

也就是说,与编程语言不同,自然语言更多地是关于语义而不是语法,因此您可以跳过现有解析工具的学习曲线,而采用自家酿制(自上而下,回溯,无限制的超前搜索),词法分析器和解析器,并花费大量时间来编写代码,以弄清已解析但模棱两可的自然语言语句的含义.

That said, unlike programming languages, natural languages are much more about the semantics than about the syntax, so you could be much better off skipping the learning curves of existing parsing tools, going with a home-brewed (top-down, backtracking, unlimited lookahead) lexical analyzer and parser, and spending the bulk of your time writing the code that figures out what a parsed, but ambiguous, natural-language sentence means.

这篇关于高效的无上下文语法分析器,最好是Python友好的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆