解析器组合器可以变得高效吗? [英] Can parser combinators be made efficient?

查看:17
本文介绍了解析器组合器可以变得高效吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大约 6 年前,我在 OCaml 中对自己的解析器组合器进行了基准测试,发现它们比当时提供的解析器生成器慢约 5 倍.我最近重新审视了这个主题,并对 Haskell 的 Parsec 与一个简单的手动优先级爬升解析器进行了基准测试 用 F# 编写,并惊讶地发现 F# 比 Haskell 快 25 倍.

Around 6 years ago, I benchmarked my own parser combinators in OCaml and found that they were ~5× slower than the parser generators on offer at the time. I recently revisited this subject and benchmarked Haskell's Parsec vs a simple hand-rolled precedence climbing parser written in F# and was surprised to find the F# to be 25× faster than the Haskell.

这是我用来从文件中读取大型数学表达式、解析和评估它的 Haskell 代码:

Here's the Haskell code I used to read a large mathematical expression from file, parse and evaluate it:

import Control.Applicative
import Text.Parsec hiding ((<|>))

expr = chainl1 term ((+) <$ char '+' <|> (-) <$ char '-')

term = chainl1 fact ((*) <$ char '*' <|> div <$ char '/')

fact = read <$> many1 digit <|> char '(' *> expr <* char ')'

eval :: String -> Int
eval = either (error . show) id . parse expr "" . filter (/= ' ')

main :: IO ()
main = do
    file <- readFile "expr"
    putStr $ show $ eval file
    putStr "
"

这是我在 F# 中独立的优先级爬升解析器:

and here's my self-contained precedence climbing parser in F#:

let rec (|Expr|) = function
  | P(f, xs) -> Expr(loop (' ', f, xs))
  | xs -> invalidArg "Expr" (sprintf "%A" xs)
and loop = function
  | ' ' as oop, f, ('+' | '-' as op)::P(g, xs)
  | (' ' | '+' | '-' as oop), f, ('*' | '/' as op)::P(g, xs) ->
      let h, xs = loop (op, g, xs)
      match op with
      | '+' -> (+) | '-' -> (-) | '*' -> (*) | '/' | _ -> (/)
      |> fun op -> loop (oop, op f h, xs)
  | _, f, xs -> f, xs
and (|P|_|) = function
  | '('::Expr(f, ')'::xs) -> Some(P(f, xs))
  | c::_ as xs when '0' <= c && c <= '9' ->
      let rec loop n = function
        | c2::xs when '0' <= c2 && c2 <= '9' -> loop (10*n + int(string c2)) xs
        | xs -> Some(P(n, xs))
      loop 0 xs
  | _ -> None

我的印象是,即使是最先进的解析器组合器也会浪费大量时间回溯.那是对的吗?如果是这样,是否可以编写生成状态机的解析器组合器以获得有竞争力的性能,还是必须使用代码生成?

My impression is that even state-of-the-art parser combinators waste a lot of time back tracking. Is that correct? If so, is it possible to write parser combinators that generate state machines to obtain competitive performance or is it necessary to use code generation?

这是我用来生成 ~2Mb 表达式以进行基准测试的 OCaml 脚本:

Here's the OCaml script I used to generate a ~2Mb expression for benchmarking:

open Printf

let rec f ff n =
  if n=0 then fprintf ff "1" else
    fprintf ff "%a+%a*(%a-%a)" f (n-1) f (n-1) f (n-1) f (n-1)

let () =
  let n = try int_of_string Sys.argv.(1) with _ -> 3 in
  fprintf stdout "%a
" f n

推荐答案

我目前正在开发 FParsec (v. 0.9) 的下一个版本,在许多情况下,相对于当前版本.

I'm currently working on the next version of FParsec (v. 0.9), which will in many situations improve performance by up to a factor of 2 relative to the current version.

[更新:FParsec 0.9 已发布,请参阅 http://www.quanttec.com/fparsec ]

[Update: FParsec 0.9 has been released, see http://www.quanttec.com/fparsec ]

我已经针对两个 FParsec 实现测试了 Jon 的 F# 解析器实现.第一个 FParsec 解析器是 djahandarie 解析器的直接翻译.第二个使用 FParsec 的可嵌入运算符优先级组件.作为输入,我使用了一个由 Jon 的 OCaml 脚本和参数 10 生成的字符串,它给了我大约 2.66MB 的输入大小.所有解析器都在发布模式下编译,并在 32 位 .NET 4 CLR 上运行.我只测量了纯解析时间,不包括启动时间或构建输入字符串(对于 FParsec 解析器)或字符列表(Jon 的解析器)所需的时间.

I've tested Jon's F# parser implementation against two FParsec implementations. The first FParsec parser is a direct translation of djahandarie's parser. The second one uses FParsec's embeddable operator precedence component. As the input I used a string generated with Jon's OCaml script with parameter 10, which gives me an input size of about 2.66MB. All parsers were compiled in release mode and were run on the 32-bit .NET 4 CLR. I only measured the pure parsing time and didn't include startup time or the time needed for constructing the input string (for the FParsec parsers) or the char list (Jon's parser).

我测量了以下数字(括号中 v. 0.9 的更新数字):

I measured the following numbers (updated numbers for v. 0.9 in parens):

  • Jon 的手动解析器:~230 毫秒
  • FParsec 解析器 #1:~270 毫秒(~235 毫秒)
  • FParsec 解析器 #2:~110 毫秒(~102 毫秒)

鉴于这些数字,我想说解析器组合器绝对可以提供具有竞争力的性能,至少对于这个特定问题,尤其是如果您考虑到 FParsec

In light of these numbers, I'd say that parser combinators can definitely offer competitive performance, at least for this particular problem, especially if you take into account that FParsec

  • 自动生成高度可读的错误消息,
  • 支持非常大的文件作为输入(具有任意回溯),并且
  • 带有一个声明性的、运行时可配置的运算符优先级解析器模块.

这是两个 FParsec 实现的代码:

Here's the code for the two FParsec implementations:

解析器 #1(djahandarie 解析器的翻译):

open FParsec

let str s = pstring s
let expr, exprRef = createParserForwardedToRef()

let fact = pint32 <|> between (str "(") (str ")") expr
let term =   chainl1 fact ((str "*" >>% (*)) <|> (str "/" >>% (/)))
do exprRef:= chainl1 term ((str "+" >>% (+)) <|> (str "-" >>% (-)))

let parse str = run expr str

解析器 #2(惯用的 FParsec 实现):

open FParsec

let opp = new OperatorPrecedenceParser<_,_,_>()
type Assoc = Associativity

let str s = pstring s
let noWS = preturn () // dummy whitespace parser

opp.AddOperator(InfixOperator("-", noWS, 1, Assoc.Left, (-)))
opp.AddOperator(InfixOperator("+", noWS, 1, Assoc.Left, (+)))
opp.AddOperator(InfixOperator("*", noWS, 2, Assoc.Left, (*)))
opp.AddOperator(InfixOperator("/", noWS, 2, Assoc.Left, (/)))

let expr = opp.ExpressionParser
let term = pint32 <|> between (str "(") (str ")") expr
opp.TermParser <- term

let parse str = run expr str

这篇关于解析器组合器可以变得高效吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆