Prolog DCG 语法规则中的堆栈溢出:如何高效或懒惰地处理大型列表 [英] Stack overflow in Prolog DCG grammar rule: how to handle large lists efficiently or lazily

查看:8
本文介绍了Prolog DCG 语法规则中的堆栈溢出:如何高效或懒惰地处理大型列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析一个由一系列行组成的相当简单的文件格式,每行都有一些空格分隔的字段,如下所示:

l 0x9823 1s 0x1111 3l 0x1111 12⋮

我正在使用 SWI-Prolog.这是我目前拥有的 DCG:

:- 咨询(库(pure_input)).load_trace(文件名,跟踪):-phrase_from_file(trace_file_phrase(痕迹),文件名).trace_file_phrase([]) -->[].trace_file_phrase([T|Ts]) -->trace_phrase(T),trace_file_phrase(Ts).trace_phrase(访问(类型,地址,SinceLast))-->access_type(类型),空间,地址(地址),空间,nat(SinceLast),换行符.访问类型(加载)-->升".访问类型(存储)-->s".地址(数字)-->0x",十六进制数(数字).十六进制数字(N)-->数字(N).十六进制数(10)-->一个".十六进制数(11)-->乙".十六进制数字(12)-->C".十六进制数字(13)-->d".十六进制数(14)-->e".十六进制数字(15)-->F".十六进制数(N)-->十六进制数(D),十六进制数(D,N).十六进制数(N,N)-->[].十六进制数(A,N)-->hexdigit(D), { A1 是 A*16 + D }, hexnum(A1, N).换行 -->
".空间-->".%% 以下两部作品由 Lars Mans 提供%% https://stackoverflow.com/questions/3279822/parsing-numbers-with-multiple-digits-in-prolog数字(0)-->0".数字(1)-->1".数字(2)-->2".数字(3)-->3".数字(4)-->4".数字(5)-->5".数字(6)-->6".数字(7)-->7".数字(8)-->8".数字(9)-->9".nat(N) -->数字(D),自然(D,N).nat(N,N) -->[].nat(A,N) -->数字(D),{ A1 是 A*10 + D },nat(A1,N).

正如评论中提到的,我从 Parsing numbers withProlog中的多个数字.

我遇到的问题是其中一些文件很大,例如大约 5-10 MB.SWI-Prolog 中的默认堆栈对此不够用,并且解析这些文件需要大量时间,大约 5-15 秒.我对这种情况有几个问题:

  1. 这段代码的效率问题在哪里?我认为它在 trace_file_phrase//1nat//1 中,但这些只是预感.
  2. 如果问题是列表,有没有比这更好的方法来处理带有 DCG 的列表?
  3. 一般来说,如何诊断和处理此类 DCG 的性能问题?

解决方案

一般来说,您会在名称 库(pio).另外干净利落的使用方法是:

:- use_module(library(pio)).

你的例子太复杂了,所以我只考虑一个稍微简单的例子,一个换行符分隔的数字列表:

<上一页>nats([]) --> [].nats([N|Ns]) --> nat(N),换行,nat(Ns).

那么,如何有效地测试呢?(这是您的问题 3)library(pio) 的基本点是您可以使用常规 DCG 进行文件处理.但是对于小型测试,您仍然可以使用简单的 phrase/2.所以我这样做:

<上一页>?- 短语(nats(Ns),"1 ").Ns = [1];错误的.

你看到 ; 提示了吗?这意味着:Prolog 无法决定是否可以计算进一步的答案——因此它留下了一个或多个选择点.只有一个数字你可以想象事情会如何堆积.

让我们深入挖掘:

<上一页>?- 短语(数字(D),1").D = 1;错误的.

又是邪恶的;!为了完成这项工作,一切都必须确定.有以下三种方法:

使用削减(并失去你的灵魂)

祝你好运——最好的似乎是在重复元素之后:

<上一页>trace_file_phrase([]) --> [].trace_file_phrase([T|Ts]) -->跟踪短语(T),!,%丑陋,但是...跟踪文件短语(Ts).

(这应该回答问题 1)

但是,等一下!这个 ! 有什么不好?只要确切地有一个trace_phrase//1的答案,事情就是完美的.只是,如果有更多答案(或实际解决方案),削减可能会删除宝贵的答案.你怎么知道,如果有更多的解决方案?好吧,你没有.而且你不会看到它们,因为它们已经被切掉了.

call_semidet/1

这是一种确保不会发生这种情况的方法.这仅适用于可以调用两次而没有任何影响的无副作用目标:

<上一页>call_semidet(目标):-( call_nth(目标, 2)-> 抛出(错误(mode_error(semidet,Goal),_));一次(目标)).

这使用 call_nth/2,在另一篇文章中定义.(作为一种优化,当没有选择点打开时,实现可以避免调用 Goal 两次......)只是为了弄清楚它是如何工作的:

<上一页>?- 短语(nat(N),1234").N = 1234 ;错误的.?- call_semidet(短语(nat(N),"1234")).N = 1234.?- call_semidet((X=1;X=2)).错误:未知错误术语:mode_error(semidet, (2=1;2=2))

所以它使你的小语法有效地确定!因此,无需重新制定任何内容!

现在缺少的是把它整合到语法中.您可以在非常低的级别执行此操作,或者使用 <代码>库(lambda).

<上一页>短语_semidet(NT) -->呼叫(S0^S^call_semidet(短语(NT,S0,S))).

请注意,在这种非常特殊的情况下,我们不使用 进行重命名.

<上一页>trace_file_phrase([]) --> [].trace_file_phrase([T|Ts]) -->短语_semidet(trace_phrase(T)),跟踪文件短语(Ts).

利用索引

最后,一个非常费力但干净的方法是重写所有内容以更好地从索引中获利(并且可能有助于总体上改进索引......)但这是一条漫长的道路.刚开始:

<上一页>数字(D)-> [C],{c_digit(C,D)}.c_digit(0'0,0).c_digit(0'1,1).c_digit(0'2,2).c_digit(0'3,3).c_digit(0'4,4).c_digit(0'5,5).c_digit(0'6,6).c_digit(0'7,7).c_digit(0'8,8).c_digit(0'9,9).

现在给你:

<上一页>?- 短语(数字(D),1").D = 1.

但是您还有另一个不确定性来源,这与您定义语法的方式有关.在 nat//2 你可以看到:

<上一页>nat(N,N) --> [].nat(A,N) --> 数字(D), ... .

第一条规则始终适用,即 "1234 " 将被解析为 "1" "12" "123" "1234" 只有下面的 newline//0 意识到最后一个就足够了 —然后坚持下去.

你可以为此重写一些东西,但是,代码不再是你喜欢的纯粹的小规范,不是吗?好吧,也许将来情况会有所改善.

例如SWI 中的索引比以前好得多,也许这里的事情也在发展....

library(pio) 的目的是启动这个过程.将此与 Haskell 进行比较 -在效率方面,我们离 interact 还很远!但没有固有成本:

<上一页>... --> [] |[_],...?-phrase_from_file((...,"searchstring",...),fichier).

和 grep — 一样高效空间方面.也就是说,它在恒定空间中运行.所以希望将来有更多的代码运行得更好.

顺便说一句,library(pio) 确实已经在效率方面产生了影响:GC 阶段得到了显着改善,与 Wadler 的修复一些空间泄漏的方式非常相似–四分之一世纪前的纸.事情在发展......

I'm parsing a fairly simple file format consisting of a series of lines, each line having some space separated fields, that looks like this:

l 0x9823 1
s 0x1111 3
l 0x1111 12
⋮

I'm using SWI-Prolog. This is the DCG I have so far:

:- consult(library(pure_input)).

load_trace(Filename, Traces) :-
    phrase_from_file(trace_file_phrase(Traces), Filename).

trace_file_phrase([]) --> [].
trace_file_phrase([T|Ts]) --> trace_phrase(T), trace_file_phrase(Ts).

trace_phrase(access(Type, Address, SinceLast)) -->
    access_type(Type), space,
    address(Address),  space,
    nat(SinceLast),    newline.

access_type(load)  --> "l".
access_type(store) --> "s".

address(Number) --> "0x", hexnum(Number).

hexdigit(N)  --> digit(N).
hexdigit(10) --> "a". hexdigit(11) --> "b". hexdigit(12) --> "c".
hexdigit(13) --> "d". hexdigit(14) --> "e". hexdigit(15) --> "f".
hexnum(N) --> hexdigit(D), hexnum(D, N).
hexnum(N, N) --> [].
hexnum(A, N) --> hexdigit(D), { A1 is A*16 + D }, hexnum(A1, N).

newline --> "
".
space --> " ".

%% the following two productions are courtesy of Lars Mans at
%% https://stackoverflow.com/questions/3279822/parsing-numbers-with-multiple-digits-in-prolog
digit(0) --> "0". digit(1) --> "1". digit(2) --> "2".
digit(3) --> "3". digit(4) --> "4". digit(5) --> "5".
digit(6) --> "6". digit(7) --> "7". digit(8) --> "8".
digit(9) --> "9".

nat(N)   --> digit(D), nat(D,N).
nat(N,N) --> [].
nat(A,N) --> digit(D), { A1 is A*10 + D }, nat(A1, N).

As mentioned in the comment, I cribbed the number handling from Parsing numbers with multiple digits in Prolog.

The problem I'm running into is some of these files are large, like, on the order of 5-10 MB. The default stack in SWI-Prolog is insufficient for this, and parsing these files is taking substantial time, on the order of 5-15 seconds. I have several questions about this situation:

  1. Where is the efficiency problem in this code? I think it's either in trace_file_phrase//1 or nat//1 but these are just hunches.
  2. If the problem is lists, is there a better way to handle lists with DCGs than this?
  3. How does one, in general, diagnose and treat performance problems with DCGs such as this?

解决方案

As a general remark you will find more on SO about it under the name library(pio). Also the way to use it cleanly is rather:

:- use_module(library(pio)).

Your example is way too complex, so I will only consider a slightly simpler case, a newline separated list of numbers:

nats([]) --> [].
nats([N|Ns]) --> nat(N), newline, nats(Ns).

So, how can you test this effectively? (That's your Question 3) The basic point of library(pio) is that you can use regular DCGs for file processing. But for testing in the small you can still use the simple phrase/2. So I do:

?- phrase(nats(Ns),"1
").
Ns = [1] ;
false.

Did you see the ; prompt? That means: Prolog was not able to decide whether or not further answers might be computed - so it leaves one or more choice-points open. And that only for a single digit You can imagine how things will pile up.

Let's dig deeper:

?- phrase(digit(D),"1").
D = 1 ;
false.

Again the evil ;! In order to make this work, everything would have to be determinate. There are three ways to this:

Use cuts (and lose your soul)

I wish you luck - the best seems to be just after the repeating element:

trace_file_phrase([]) --> [].
trace_file_phrase([T|Ts]) -->
   trace_phrase(T),
   !, % ugly, but...
   trace_file_phrase(Ts).

(This should answer Question 1)

But, hold a minute! What is so bad about this !? As long, as there is exactly one answer to trace_phrase//1 things are perfect. It is only, if there are more answers (or actually solutions), that the cut might remove precious answers. How do you know, if there are more solutions? Well, you don't. And you won't see them as they have been cut away already.

call_semidet/1

Here is a way to ensure that this does not happen. This works only for side-effect free goals that can be called twice without any effect:

call_semidet(Goal) :-
   (  call_nth(Goal, 2)
   -> throw(error(mode_error(semidet,Goal),_))
   ;  once(Goal)
   ).

This uses call_nth/2, as defined in another post. (As an optimization, the implementation could avoid calling Goal twice when there is no choice-point open...) Just to make clear, how it works:

?- phrase(nat(N),"1234").
N = 1234 ;
false.

?- call_semidet(phrase(nat(N),"1234")).
N = 1234.

?- call_semidet((X=1;X=2)).
ERROR: Unknown error term: mode_error(semidet, (2=1;2=2))

So it makes your little grammar effectively determinate! There is thus no need to reformulate anything!

What is lacking now is some integration of this into the grammar. You can do this very low-level, or rather cleanly using library(lambda).

phrase_semidet(NT) -->
   call(S0^S^call_semidet(phrase(NT,S0,S))).

Note that in this very particular case we do not use the for renaming.

trace_file_phrase([]) --> [].
trace_file_phrase([T|Ts]) -->
   phrase_semidet(trace_phrase(T)),
   trace_file_phrase(Ts).

Exploit indexing

Finally, a very laborious but clean way would be to rewrite everything to profit better from indexing (and maybe help to improve indexing in general...) But this is a long road. Just to start with:

digit(D) --> [C],
   {c_digit(C,D)}.

c_digit(0'0,0).
c_digit(0'1,1).
c_digit(0'2,2).
c_digit(0'3,3).
c_digit(0'4,4).
c_digit(0'5,5).
c_digit(0'6,6).
c_digit(0'7,7).
c_digit(0'8,8).
c_digit(0'9,9).

This gives you now:

?- phrase(digit(D),"1").
D = 1.

But you have another source of nondeterminism which is rather due to the way you define the grammar. In nat//2 you see it:

nat(N,N) --> [].
nat(A,N) --> digit(D), ... .

The first rule always applies, that is, "1234 " will be parsed as "1" "12" "123" "1234" only the following newline//0 realizes that it would suffice go for the last — and then stick to it.

You can rewrite things for that, but then, the code is no longer the pure little spec you liked, isn't it? Well, maybe things might improve in the future.

E.g. indexing is much better in SWI than it used to be, maybe also things evolve here too....

The intention of library(pio) was to get this process started. Compare this to Haskell — we are far away from interact efficiency-wise! But there is no inherent cost:

... --> [] | [_], ... .

?- phrase_from_file((...,"searchstring",...),fichier).

is just as efficient as grep — space-wise. That is, it runs in constant space. So hopefully more code will run better in the future.

Edit: BTW, library(pio) did already have an impact efficiency-wise: GC phases were improved significantly, very much in the same manner as Wadler's Fixing some space leak – paper a quarter century ago. Things evolve ...

这篇关于Prolog DCG 语法规则中的堆栈溢出:如何高效或懒惰地处理大型列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆