是什么让 DCG 谓词变得昂贵? [英] What makes a DCG predicate expensive?

查看:58
本文介绍了是什么让 DCG 谓词变得昂贵?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在构建一个定语句语法来解析 20,000 条半自然文本.随着我的谓词数据库的大小增长(现在多达 1,200 条规则),解析一个字符串可能需要相当长的时间——特别是对于 DCG 当前无法解释的字符串,由于我还没有编码的语法.对于包含 30 个单词的字符串,当前的最坏情况是 3 分钟.我正在尝试弄清楚如何优化这一点,或者我是否应该开始研究云计算.

I'm building a Definite Clause Grammar to parse 20,000 pieces of semi-natural text. As the size of my database of predicates grows (now up to 1,200 rules), parsing a string can take quite a long time -- particularly for strings that are not currently interpretable by the DCG, due to syntax I haven't yet encoded. The current worst-case is 3 minutes for a string containing 30 words. I'm trying to figure out how I can optimize this, or if I should just start researching cloud computing.

我正在使用 SWI-Prolog,它提供了一个配置文件"目标,它提供了一些统计信息.我惊讶地发现数据库中最简单的规则占用了大部分执行时间.我的语料库包含表示数字的字符串,我想在 scalar/3 谓词中捕获这些字符串.这些占用了总执行时间的 50-60%.

I'm using SWI-Prolog, and that provides a "profile" goal, which provides some statistics. I was surprised to find that the simplest rules in my database are taking up the majority of execution time. My corpus contains strings that represent numbers, and I want to capture these in a scalar/3 predicate. These are hogging ~50-60% of total execution time.

一开始,我的 scalars.pl 中有 70 行,代表我的语料库中数字的数字和自然语言表示.像这样:

At the outset, I had 70 lines in my scalars.pl, representing the numeric and natural language representations of the numbers in my corpus. Like so:

scalar(scalar(3)) --> ["three"].
scalar(scalar(3)) --> ["3"].
scalar(scalar(4)) --> ["four"].
scalar(scalar(4)) --> ["4"].

...等等.

认为文件的长度是问题所在,我添加了一个新规则来自动解析任何数字表示:

Thinking that the length of the file was the problem, I put in a new rule that would automatically parse any numeric representations:

scalar(scalar(X)) --> [Y], { atom_number(Y, X) }.

多亏了这一点,我已经从 70 条规则增加到 31 条规则,并有所帮助 -- 但这并不是一笔巨大的节省.还有什么可以做的吗?我的感觉可能不是,因为还有什么比列表中的单个原子更简单的呢?

Thanks to that, I've gone from 70 rules to 31, and helped a bit -- but it wasn't a huge savings. Is there anything more that can be done? My feeling is maybe not, because what could be simpler than a single atom in a list?

这些标量在整个语法中的很多地方都被调用,我认为这是问题的根源.虽然它们是简单的规则,但它们无处不在,而且不可避免.高度通用的语法不适用于我的应用程序,如果我最终得到 3,000 条或更多规则,我也不会感到惊讶.

These scalars are called in a lot of places throughout the grammar, and I assume that's the root of the issue. Though they're simple rules, they're everywhere, and unavoidably so. A highly general grammar just won't work for my application, and I wouldn't be surprised if I end up with 3,000 rules or more.

我从来没有构建过这么大的 DCG,所以我不确定在性能方面我可以期待多少.很高兴就此提出任何建议:是否有其他编码这些规则的方法?我是否应该接受某些解析需要很长时间,并弄清楚如何并行运行解析?

I've never built a DCG this large, so I'm not sure how much I can expect in terms of performance. Happy to take any kind of advice on this one: is there some other way of encoding these rules? Should I accept that some parses will take a long time, and figure out how to run parses in parallel?

先谢谢你!

我被要求提供一个可重现的示例,但要做到这一点,我必须将 SO 链接到整个项目,因为这是一个规模问题.为了完整起见,这是我正在做的事情的玩具版本.想象一下,有大量文件描述了数百个名词、数百个动词和数百个句法结构.

I was asked to provide a reproducible example, but to do that I'd have to link SO to the entire project, since this is an issue of scale. Here's a toy version of what I'm doing for the sake of completeness. Just imagine there were large files describing hundreds of nouns, hundreds of verbs, and hundreds of syntactic structures.

sent(sent(VP, NP)) --> vp(VP), np(NP).
vp(vp(V)) --> v(V).
np(np(Qty, Noun)) --> qty(Qty), n(Noun).
scalar(scalar(3)) --> ["three"].
scalar(scalar(X)) --> [Y], { atom_number(Y, X) }.

qty(qty(Scalar)) --> scalar(Scalar).
v(v(eat)) --> ["eat"].
n(n(pie)) --> ["pie"].

推荐答案

您可能要调查的程序的一个方面是确保单个谓词快速成功并快速失败.这对于检查具有许多子句的谓词特别有用.

One aspect of your program that you might investigate is to make sure individual predicates succeed quickly and fail quickly. This is particularly useful to check for predicates that have many clauses.

例如,当 scalar(X) 在一个不是标量的标记上求值时,程序必须尝试 31 次(根据您的最后一次计数),然后才能确定 scalar//1 失败.如果您的程序结构是针对每个令牌检查 scalar(X),那么这可能会非常昂贵.

For instance, when scalar(X) is evaluated on a token that is not a scalar then the program will have to try 31 (by your last count) times before it can determine that scalar//1 fails. If the structure of your program is such that scalar(X) is checked against every token then this could be very expensive.

此外,如果 scalar(X) 碰巧发现一个标记匹配但后续目标失败,那么您的程序似乎将重试 scalar(X),直到所有 scalar//1 子句都已尝试完毕.

Further, if scalar(X) does happen to find that a token matches but a subsequent goal fails then it appears that your program will retry the scalar(X) until all of the scalar//1 clauses have been attempted.

明智地使用 cut (!) 或 if-then-else (C1->G1;C2->G2;G3) 可以提供巨大的性能改进.或者您可以构建您的谓词,以便它们依靠索引来选择适当的子句.例如:

The judicious use of cut (!) or if-then-else (C1->G1;C2->G2;G3) can provide a tremendous performance improvement. Or you can structure your predicates so that they rely on indexing to select the appropriate clause. E.g.:

scalar(scalar(N)) --> [Token], {scalar1(Token, scalar(N))}.

scalar1("3", scalar(3)) :- !.
scalar1(Y, scalar(X)) :- atom_number(Y, X).

这将 cut 和子句索引(如果编译器提供)与 scalar1/1 谓词一起使用.

This uses both cut and clause indexing (if the compiler provides it) with the scalar1/1 predicate.

您应该阅读 R. A. O'Keefe 的序言的工艺.它是 Prolog 实践方面的极好指南.

You should read R. A. O'Keefe's The Craft of Prolog. It is an excellent guide to the practical aspects of Prolog.

这篇关于是什么让 DCG 谓词变得昂贵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆