高效的上下文无关语法解析器,最好是 Python 友好的 [英] Efficient Context-Free Grammar parser, preferably Python-friendly

查看:36
本文介绍了高效的上下文无关语法解析器,最好是 Python 友好的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要为我的一个项目解析一小部分英语,描述为具有(1 级)特征结构的上下文无关语法(example),我需要有效地做到这一点.

I am in need of parsing a small subset of English for one of my project, described as a context-free grammar with (1-level) feature structures (example) and I need to do it efficiently .

现在我正在使用 NLTK 的解析器,它产生正确的输出但速度很慢.对于我的约 450 个相当模糊的非词典规则和 50 万个词汇条目的语法,解析简单的句子可能需要 2 到 30 秒,这取决于结果树的数量.词法条目对性能几乎没有影响.

Right now I'm using NLTK's parser which produces the right output but is very slow. For my grammar of ~450 fairly ambiguous non-lexicon rules and half a million lexical entries, parsing simple sentences can take anywhere from 2 to 30 seconds, depending it seems on the number of resulting trees. Lexical entries have little to no effect on performance.

另一个问题是在开头加载 (25MB) 语法+词典可能需要一分钟.

Another problem is that loading the (25MB) grammar+lexicon at the beginning can take up to a minute.

从我在文献中可以找到的信息来看,用于解析这种语法(Earley 或 CKY)的算法的运行时间应该与语法的大小成线性关系,与输入标记列表的大小成三次方.我使用 NLTK 的经验表明,歧义是对性能影响最大的,而不是语法的绝对大小.

From what I can find in literature, the running time of the algorithm used to parse such a grammar (Earley or CKY) should be linear to the size of the grammar and cubic to the size of the input token list. My experience with NLTK indicates that ambiguity is what hurts the performance most, not the absolute size of the grammar.

所以现在我正在寻找一个 CFG 解析器来替换 NLTK.我一直在考虑 PLY 但我不知道它是否支持 CFG 中的特征结构,这在我的情况下是必需的,我看到的例子似乎在做很多程序解析而不是仅仅指定一个语法.任何人都可以向我展示一个支持功能结构和使用声明性语法的 PLY 示例吗?

So now I'm looking for a CFG parser to replace NLTK. I've been considering PLY but I can't tell whether it supports feature structures in CFGs, which are required in my case, and the examples I've seen seem to be doing a lot of procedural parsing rather than just specifying a grammar. Can anybody show me an example of PLY both supporting feature structs and using a declarative grammar?

我也可以使用任何其他可以高效完成我需要的解析器.Python 接口更可取,但并非绝对必要.

I'm also fine with any other parser that can do what I need efficiently. A Python interface is preferable but not absolutely necessary.

推荐答案

一定要看看 Pyparsing.这是我遇到的最 Pythonic 的解析实现,从纯粹的学术角度来看,这是一个很棒的设计.

By all means take a look at Pyparsing. It's the most pythonic implementations of parsing I've come across, and it's a great design from a purely academic standpoint.

我同时使用了 ANTLRJavaCC 在当地大学教授翻译和编译器理论.它们既好又成熟,但我不会在 Python 项目中使用它们.

I used both ANTLR and JavaCC to teach translator and compiler theory at a local university. They're both good and mature, but I wouldn't use them in a Python project.

也就是说,与编程语言不同,自然语言更多地关注语义而不是语法,因此您最好跳过现有解析工具的学习曲线,使用自制的(自上而下,回溯、无限制的前瞻)词法分析器和解析器,并花费大量时间编写代码来弄清楚一个已解析但模棱两可的自然语言句子的含义.

That said, unlike programming languages, natural languages are much more about the semantics than about the syntax, so you could be much better off skipping the learning curves of existing parsing tools, going with a home-brewed (top-down, backtracking, unlimited lookahead) lexical analyzer and parser, and spending the bulk of your time writing the code that figures out what a parsed, but ambiguous, natural-language sentence means.

这篇关于高效的上下文无关语法解析器,最好是 Python 友好的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆