解析Sexp的一种优雅方法 [英] An elegant way to parse sexp

查看:79
本文介绍了解析Sexp的一种优雅方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

sexp 类似于:type sexp = Atom of string | List of sexp list,例如"((a b) ((c d) e) f)".

我已经编写了一个解析器来将sexp字符串解析为以下类型:

let of_string s =
  let len = String.length s in
  let empty_buf () = Buffer.create 16 in
  let rec parse_atom buf i =
    if i >= len then failwith "cannot parse"
    else 
      match s.[i] with 
      | '(' -> failwith "cannot parse"
      | ')' -> Atom (Buffer.contents buf), i-1
      | ' ' -> Atom (Buffer.contents buf), i
      | c when i = len-1 -> (Buffer.add_char buf c; Atom (Buffer.contents buf), i)
      | c -> (Buffer.add_char buf c; parse_atom buf (i+1))
  and parse_list acc i =
    if i >= len || (i = len-1 && s.[i] <> ')') then failwith "cannot parse"
    else 
      match s.[i] with
      | ')' -> List (List.rev acc), i
      | '(' -> 
        let list, j = parse_list [] (i+1) in
        parse_list (list::acc) (j+1)
      | c -> 
        let atom, j = parse_atom (empty_buf()) i in
        parse_list (atom::acc) (j+1)
  in 
  if s.[0] <> '(' then
    let atom, j = parse_atom (empty_buf()) 0 in
    if j = len-1 then atom
    else failwith "cannot parse"
  else 
    let list, j = parse_list [] 1 in
    if j = len-1 then list
    else failwith "cannot parse"

但是我认为它太冗长和丑陋.

有人可以用一种优雅的方式帮助我编写这样的解析器吗?

实际上,我在编写解析器代码时总是遇到问题,而我只能写出这么丑陋的代码.

这种解析有什么窍门吗?如何有效处理暗示递归解析的符号(例如())?

解决方案

您可以使用lexer + parser学科将词法语法的细节(主要是跳过空格)与实际语法结构分开.对于这样一个简单的语法来说,这似乎有些矫kill过正,但实际上,只要您解析的数据极有可能出现错误,它就会变得更好:您确实想要错误位置(而不是自己实现).

一种简单且可提供短解析器的技术是使用流解析器( Genlex 模块免费获得词法分析器.

如果您想真正地手动完成它,如上面的示例中所示,我建议您使用一个不错的解析器结构.具有相互递归的解析器,每个语法对应一个语法解析器,并具有以下接口:

  • 解析器将开始解析的索引作为输入
  • 它们返回一对解析后的值和该值的第一个索引 not 部分
  • 没什么

您的代码不遵守此结构.例如,如果原子的解析器看到(,则将失败.那不是他的角色和责任:它应该简单地认为此字符不是原子的一部分,并返回原子已解析的位置,表明该位置不再在原子中.

这里是这种风格的代码示例,供您语法参考.我已将解析器与累加器分成三部分(start_fooparse_foofinish_foo)以分解多个起点或返回点,但这只是实现细节.

我只是出于娱乐目的而使用了4.02的新功能,与异常匹配,而不是显式测试字符串的结尾.恢复花哨的东西当然是微不足道的.

最后,如果有效的表达式在输入的结尾之前结束,则当前的解析器不会失败,它只会在侧面返回输入的结尾.这对测试很有帮助,但是无论您打算做什么,您都将在生产"中以不同的方式进行操作.

let of_string str =
  let rec parse i =
    match str.[i] with
      | exception _ -> failwith "unfinished input"
      | ')' -> failwith "extraneous ')'"
      | ' ' -> parse (i+1)
      | '(' -> start_list (i+1)
      | _ -> start_atom i
  and start_list i = parse_list [] i
  and parse_list acc i =
    match str.[i] with
      | exception _ -> failwith "unfinished list"
      | ')' -> finish_list acc (i+1)
      | ' ' -> parse_list acc (i+1)
      | _ ->
        let elem, j = parse i in
        parse_list (elem :: acc) j
  and finish_list acc i =
    List (List.rev acc), i
  and start_atom i = parse_atom (Buffer.create 3) i
  and parse_atom acc i =
    match str.[i] with
      | exception _ -> finish_atom acc i
      | ')' | ' ' -> finish_atom acc i
      | _ -> parse_atom (Buffer.add_char acc str.[i]; acc) (i + 1)
  and finish_atom acc i =
    Atom (Buffer.contents acc), i
  in
  let result, rest = parse 0 in
  result, String.sub str rest (String.length str - rest)

请注意,在解析有效表达式(您必须读取至少一个原子或列表)或解析列表(您必须遇到右括号)时到达输入结尾是错误的,但是在原子的末尾有效.

此解析器不返回位置信息.所有现实世界中的解析器都应该这样做,这是使用词法分析器/解析器方法(或您首选的monadic解析器库)而不是手工完成的足够理由.在这里返回位置信息并不是很困难,但是,只需将i参数复制到当前已解析字符的索引中,另一方面复制到用于当前AST节点的第一个索引中即可.只要产生结果,位置就是对(first indexlast valid index).

sexp is like this: type sexp = Atom of string | List of sexp list, e.g., "((a b) ((c d) e) f)".

I have written a parser to parse a sexp string to the type:

let of_string s =
  let len = String.length s in
  let empty_buf () = Buffer.create 16 in
  let rec parse_atom buf i =
    if i >= len then failwith "cannot parse"
    else 
      match s.[i] with 
      | '(' -> failwith "cannot parse"
      | ')' -> Atom (Buffer.contents buf), i-1
      | ' ' -> Atom (Buffer.contents buf), i
      | c when i = len-1 -> (Buffer.add_char buf c; Atom (Buffer.contents buf), i)
      | c -> (Buffer.add_char buf c; parse_atom buf (i+1))
  and parse_list acc i =
    if i >= len || (i = len-1 && s.[i] <> ')') then failwith "cannot parse"
    else 
      match s.[i] with
      | ')' -> List (List.rev acc), i
      | '(' -> 
        let list, j = parse_list [] (i+1) in
        parse_list (list::acc) (j+1)
      | c -> 
        let atom, j = parse_atom (empty_buf()) i in
        parse_list (atom::acc) (j+1)
  in 
  if s.[0] <> '(' then
    let atom, j = parse_atom (empty_buf()) 0 in
    if j = len-1 then atom
    else failwith "cannot parse"
  else 
    let list, j = parse_list [] 1 in
    if j = len-1 then list
    else failwith "cannot parse"

But I think it is too verbose and ugly.

Can someone help me with an elegant way to write such a parser?

Actually, I always have problems in writing code of parser and what I could do only is write such a ugly one.

Any tricks for this kind of parsing? How to effectively deal with symbols, such as (, ), that implies recursive parsing?

解决方案

You can use a lexer+parser discipline to separate the details of lexical syntax (skipping spaces, mostly) from the actual grammar structure. That may seem overkill for such a simple grammar, but it's actually better as soon as the data you parse has the slightest chance of being wrong: you really want error location (and not to implement them yourself).

A technique that is easy and gives short parsers is to use stream parsers (using a Camlp4 extension for them described in the Developping Applications with Objective Caml book); you may even get a lexer for free by using the Genlex module.

If you want to do really do it manually, as in your example above, here is my recommendation to have a nice parser structure. Have mutually recursive parsers, one for each category of your syntax, with the following interface:

  • parsers take as input the index at which to start parsing
  • they return a pair of the parsed value and the first index not part of the value
  • nothing more

Your code does not respect this structure. For example, you parser for atoms will fail if it sees a (. That is not his role and responsibility: it should simply consider that this character is not part of the atom, and return the atom-parsed-so-far, indicating that this position is not in the atom anymore.

Here is a code example in this style for you grammar. I have split the parsers with accumulators in triples (start_foo, parse_foo and finish_foo) to factorize multiple start or return points, but that is only an implementation detail.

I have used a new feature of 4.02 just for fun, match with exception, instead of explicitly testing for the end of the string. It is of course trivial to revert to something less fancy.

Finally, the current parser does not fail if the valid expression ends before the end of the input, it only returns the end of the input on the side. That's helpful for testing but you would do it differently in "production", whatever that means.

let of_string str =
  let rec parse i =
    match str.[i] with
      | exception _ -> failwith "unfinished input"
      | ')' -> failwith "extraneous ')'"
      | ' ' -> parse (i+1)
      | '(' -> start_list (i+1)
      | _ -> start_atom i
  and start_list i = parse_list [] i
  and parse_list acc i =
    match str.[i] with
      | exception _ -> failwith "unfinished list"
      | ')' -> finish_list acc (i+1)
      | ' ' -> parse_list acc (i+1)
      | _ ->
        let elem, j = parse i in
        parse_list (elem :: acc) j
  and finish_list acc i =
    List (List.rev acc), i
  and start_atom i = parse_atom (Buffer.create 3) i
  and parse_atom acc i =
    match str.[i] with
      | exception _ -> finish_atom acc i
      | ')' | ' ' -> finish_atom acc i
      | _ -> parse_atom (Buffer.add_char acc str.[i]; acc) (i + 1)
  and finish_atom acc i =
    Atom (Buffer.contents acc), i
  in
  let result, rest = parse 0 in
  result, String.sub str rest (String.length str - rest)

Note that it is an error to reach the end of input when parsing a valid expression (you must have read at least one atom or list) or when parsing a list (you must have encountered the closing parenthesis), yet it is valid at the end of an atom.

This parser does not return location information. All real-world parsers should do so, and this is enough of a reason to use a lexer/parser approach (or your preferred monadic parser library) instead of doing it by hand. Returning location information here is not terribly difficult, though, just duplicate the i parameter into the index of the currently parsed character, on one hand, and the first index used for the current AST node, on the other; whenever you produce a result, the location is the pair (first index, last valid index).

这篇关于解析Sexp的一种优雅方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆