解析器组合器可以有效吗? [英] Can parser combinators be made efficient?

查看:109
本文介绍了解析器组合器可以有效吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大约6年前,我在OCaml中测试了自己的解析器组合器,发现它们比当时提供的解析器生成器慢5倍。我最近重新讨论了这个问题,并将Haskell的Parsec与一个简单的手动滚动的优先级攀登解析器用F#编写,并惊讶地发现F#比Haskell快25倍。



下面是我用来读取大型数学表达式的Haskell代码从文件中解析并评估它:

  import Control.Applicative 
import Text.Parsec hiding((< | >))

expr = chainl1 term((+)< $ char'+'< |>( - )< $ char' - ')

term = chainl1 fact((*)< $ char'*'< |> div< $ char'/')

fact = read< $> many1 digit< |> char'('*> expr< * char')'

eval :: String - > Int
eval =或者(error。show)id。解析expr。 filter(/ ='')

main :: IO()
main = do
文件< - readFileexpr
putStr $ show $ eval file
putStr\ n

这里是我在F#中的自包含的优先级爬升解析器:

  let rec(| Expr |)= function 
| P(f,xs) - > Expr(循环('',f,xs))
| xs - > invalidArgExpr(sprintf%Axs)
和loop =函数
| ''as oop,f,('+'|' - 'as op):: P(g,xs)
| ('''''''''为oop),f,('*'|'/'as op):: P(g,xs) - >
让h,xs = loop(op,g,xs)

匹配op | '+' - > (+)| ' - ' - > ( - )| '*' - > (*)| '/'| _ - > (/)
|>有趣的操作 - >循环(oop,op f h,xs)
| _,f,xs - > f,xs
和(| P | _ |)=函数
| '(':: Expr(f,')':: xs) - >一些(P(f,xs))
| c :: _ as'x''0'时<= c&& c< ='9' - >
让rec循环n =函数
| '0'时的c2 :: xs <= c2&& c2 <='9' - >循环(10 * n + int(字符串c2))xs
| xs - >一些(P(n,xs))
循环0 xs
| _ - > None

我的印象是,即使是最先进的解析器组合器也会浪费大量的时间跟踪。那是对的吗?如果是这样,是否有可能编写解析器组合器来生成状态机以获得有竞争力的性能,或者是否有必要使用代码生成?

编辑:



下面是我用来生成基准测试的〜2Mb表达式的OCaml脚本:

  open printf 

let rec f ff n =
if n = 0 then fprintf ff1else
fprintf ff%a +%a *(% a-%a)f(n-1)f(n-1)f(n-1)f(n-1)

let()=
let n = try int_of_string Sys.argv。(1)用_ - > 3 in
fprintf stdout%a\\\
fn


解决方案 div>

我目前正在研究FParsec(0.9版)的下一个版本,在许多情况下,相对于当前版本



[更新:FParsec 0.9已发布,请参阅 http://www.quanttec.com/fparsec ]



我测试过针对两个FParsec实现的Jon的F#解析器实现。第一个FParsec解析器是djahandarie解析器的直接翻译。第二个使用FParsec的嵌入式运算符优先组件。作为输入,我使用了Jon的OCaml脚本生成的字符串,其参数为10,这使得输入大小约为2.66MB。所有解析器都以发布模式编译,并在32位.NET 4 CLR上运行。我只测量纯粹的解析时间,没有包括启动时间或构建输入字符串(用于FParsec解析器)或char列表(Jon的解析器)所需的时间。

我测量了以下数字(对于0.9版parens更新的数字):
$ b $ ul

  • Jon的手动解析器:〜230ms

  • FParsec解析器#1:〜270ms(〜235ms)
  • FParsec解析器#2: 110ms(〜102ms)


    鉴于这些数字,我认为解析器组合器可以提供有竞争力的性能,至少对于这个特别的问题,特别是如果考虑到FParsec b
    $ b


    • 自动生成高度可读的错误消息,那么
    • 支持非常大的文件作为输入(带有任意回溯),并且
    • 带有声明式的,运行时配置的运算符优先级解析器模块。



    以下是两个FParsec实现的代码:

    解析器#1 ( djahandarie解析器的翻译):

      open FParsec 

    let str s = pstring s
    let expr,exprRef = createParserForwardedToRef()

    let fact = pint32< |> (str*>%(*))< |>(str/> >%(/)))
    do exprRef:= chainl1 term((str+>>%(+))< |>(str - >>%( - )))

    let parse str = run expr str

    解析器#2 (matic FParsec implementation ):

      open FParsec 

    let opp =新的OperatorPrecedenceParser< _,_,_>()
    类型Assoc = Associativity

    let str s = pstring s
    let noWS = preturn()//虚拟空白分析器

    opp.AddOperator(InfixOperator( - ,noWS,1,Assoc.Left,( - )))
    opp.AddOperator(InfixOperator(+,noWS,1,Assoc.Left ,(+)))
    opp.AddOperator(InfixOperator(*,noWS,2,Assoc.Left,(*)))
    opp.AddOperator(InfixOperator(/,noWS,2 ,Assoc.Left,(/)))

    let expr = opp.ExpressionParser
    let term = pint32< |> between(str()(str))expr
    opp.TermParser< - term

    let parse str = run expr str


    Around 6 years ago, I benchmarked my own parser combinators in OCaml and found that they were ~5× slower than the parser generators on offer at the time. I recently revisited this subject and benchmarked Haskell's Parsec vs a simple hand-rolled precedence climbing parser written in F# and was surprised to find the F# to be 25× faster than the Haskell.

    Here's the Haskell code I used to read a large mathematical expression from file, parse and evaluate it:

    import Control.Applicative
    import Text.Parsec hiding ((<|>))
    
    expr = chainl1 term ((+) <$ char '+' <|> (-) <$ char '-')
    
    term = chainl1 fact ((*) <$ char '*' <|> div <$ char '/')
    
    fact = read <$> many1 digit <|> char '(' *> expr <* char ')'
    
    eval :: String -> Int
    eval = either (error . show) id . parse expr "" . filter (/= ' ')
    
    main :: IO ()
    main = do
        file <- readFile "expr"
        putStr $ show $ eval file
        putStr "\n"
    

    and here's my self-contained precedence climbing parser in F#:

    let rec (|Expr|) = function
      | P(f, xs) -> Expr(loop (' ', f, xs))
      | xs -> invalidArg "Expr" (sprintf "%A" xs)
    and loop = function
      | ' ' as oop, f, ('+' | '-' as op)::P(g, xs)
      | (' ' | '+' | '-' as oop), f, ('*' | '/' as op)::P(g, xs) ->
          let h, xs = loop (op, g, xs)
          match op with
          | '+' -> (+) | '-' -> (-) | '*' -> (*) | '/' | _ -> (/)
          |> fun op -> loop (oop, op f h, xs)
      | _, f, xs -> f, xs
    and (|P|_|) = function
      | '('::Expr(f, ')'::xs) -> Some(P(f, xs))
      | c::_ as xs when '0' <= c && c <= '9' ->
          let rec loop n = function
            | c2::xs when '0' <= c2 && c2 <= '9' -> loop (10*n + int(string c2)) xs
            | xs -> Some(P(n, xs))
          loop 0 xs
      | _ -> None
    

    My impression is that even state-of-the-art parser combinators waste a lot of time back tracking. Is that correct? If so, is it possible to write parser combinators that generate state machines to obtain competitive performance or is it necessary to use code generation?

    EDIT:

    Here's the OCaml script I used to generate a ~2Mb expression for benchmarking:

    open Printf
    
    let rec f ff n =
      if n=0 then fprintf ff "1" else
        fprintf ff "%a+%a*(%a-%a)" f (n-1) f (n-1) f (n-1) f (n-1)
    
    let () =
      let n = try int_of_string Sys.argv.(1) with _ -> 3 in
      fprintf stdout "%a\n" f n
    

    解决方案

    I'm currently working on the next version of FParsec (v. 0.9), which will in many situations improve performance by up to a factor of 2 relative to the current version.

    [Update: FParsec 0.9 has been released, see http://www.quanttec.com/fparsec ]

    I've tested Jon's F# parser implementation against two FParsec implementations. The first FParsec parser is a direct translation of djahandarie's parser. The second one uses FParsec's embeddable operator precedence component. As the input I used a string generated with Jon's OCaml script with parameter 10, which gives me an input size of about 2.66MB. All parsers were compiled in release mode and were run on the 32-bit .NET 4 CLR. I only measured the pure parsing time and didn't include startup time or the time needed for constructing the input string (for the FParsec parsers) or the char list (Jon's parser).

    I measured the following numbers (updated numbers for v. 0.9 in parens):

    • Jon's hand-rolled parser: ~230ms
    • FParsec parser #1: ~270ms (~235ms)
    • FParsec parser #2: ~110ms (~102ms)

    In light of these numbers, I'd say that parser combinators can definitely offer competitive performance, at least for this particular problem, especially if you take into account that FParsec

    • automatically generates highly readable error messages,
    • supports very large files as input (with arbitrary backtracking), and
    • comes with a declarative, runtime-configurable operator-precedence parser module.

    Here's the code for the two FParsec implementations:

    Parser #1 (Translation of djahandarie's parser):

    open FParsec
    
    let str s = pstring s
    let expr, exprRef = createParserForwardedToRef()
    
    let fact = pint32 <|> between (str "(") (str ")") expr
    let term =   chainl1 fact ((str "*" >>% (*)) <|> (str "/" >>% (/)))
    do exprRef:= chainl1 term ((str "+" >>% (+)) <|> (str "-" >>% (-)))
    
    let parse str = run expr str
    

    Parser #2 (Idiomatic FParsec implementation):

    open FParsec
    
    let opp = new OperatorPrecedenceParser<_,_,_>()
    type Assoc = Associativity
    
    let str s = pstring s
    let noWS = preturn () // dummy whitespace parser
    
    opp.AddOperator(InfixOperator("-", noWS, 1, Assoc.Left, (-)))
    opp.AddOperator(InfixOperator("+", noWS, 1, Assoc.Left, (+)))
    opp.AddOperator(InfixOperator("*", noWS, 2, Assoc.Left, (*)))
    opp.AddOperator(InfixOperator("/", noWS, 2, Assoc.Left, (/)))
    
    let expr = opp.ExpressionParser
    let term = pint32 <|> between (str "(") (str ")") expr
    opp.TermParser <- term
    
    let parse str = run expr str
    

    这篇关于解析器组合器可以有效吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆