在使用像Parsec这样的解析器组合器库时,我应该使用词法分析器吗? [英] Should I use a lexer when using a parser combinator library like Parsec?

查看:110
本文介绍了在使用像Parsec这样的解析器组合器库时,我应该使用词法分析器吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在解析器组合器库(如Haskell的Parsec)中编写解析器时,通常有两个选择:


  • 编写一个词法分析器来分割字符串输入到标记,然后在 [标记]

  • 上直接执行解析在字符串
  • 上写入解析器组合器


第一种方法通常看起来有意义很多解析输入可以理解为由空格分隔的标记。



在其他地方,我看到有人建议不要标记化(或扫描 lexing ,有人称之为),简单地被引用为主要原因。

lexing和不做的一般权衡是什么它是什么?

最重要的区别是lexing会翻译您的输入域

>

一个很好的结果就是


  • 你没有我已经考虑过空白了。在一个直接的(非lexing)解析器中,你必须在允许空格的地方撒上 space 解析器,这很容易忘记,并且它会混乱你的代码if无论如何,空格必须将所有的所有分开。

  • 您可以以逐块的方式思考您的输入,对于人类来说很简单。




然而,如果您确实执行了lexing,您会遇到问题 $ b


  • 您不能在 String 上使用公共分析器 - 例如用库函数 parseFloat :: Parsec String s Float (对字符串输入流进行操作)解析数字时,必须执行类似于 takeNextToken :: TokenParser String 执行 parseFloat 解析器,检查解析结果(通常 ErrorMessage a )。这对于编写和限制可组合性来说很混乱。

  • 你已经调整了所有的错误信息。如果您的令牌解析器在第20个令牌上失败,那么在输入字符串中的位置呢?您必须手动将错误位置映射回输入字符串,这很乏味(在Parsec中,这意味着要调整所有 SourcePos 值)。


  • 错误报告通常较差。运行字符串hello*>空间*> float 在错误的输入如hello4上会精确地告诉你在 hello ,而词法分析器声称已找到无效标记


  • 许多人认为是原子单元并被词法分隔的事实上对于词法分析器来说确实是非常太难了。以字符串文字为例 - 突然hello world不是两个标记hello 世界了(但是 only ,当然,如果引号没有转义,比如 \) - while这对解析器来说是非常自然的,它意味着复杂的规则和对于词法分析器的特殊情况。

  • 定义如何解析字符串中的双精度,导出它,其余的人可以使用它;它们不能首先运行您的(专用)标记器。


  • 当你开发解析语言时,使用词法分析器可能会导致你做出早期决策,修复之后可能需要更改的东西。例如,假设你定义了一种包含 Float 标记的语言,在某些时候,你想引入负文字( -3.4 - 3.4 ) - 由于词法分析器的原因,这可能是不可能的将空白解释为令牌分隔符。使用仅解析器的方法,您可以保持更灵活,更轻松地更改您的语言。这并不令人惊讶,因为解析器是一个更复杂的工具,它本身编码规则。




<总而言之,我会建议在大多数情况下编写无词法分析器。



最后,词法分析器只是一个简单化的解析器 - if无论如何,你需要一个解析器,将它们合并为一个。






*从计算理论,我们知道;词法分析器通常是规则的,分析器上下文无关或甚至是上下文敏感的(像Parsec这样的一元语法分析器可以表示上下文敏感性)。

When writing a parser in a parser combinator library like Haskell's Parsec, you usually have 2 choices:

  • Write a lexer to split your String input into tokens, then perform parsing on [Token]
  • Directly write parser combinators on String

The first method often seems to make sense given that many parsing inputs can be understood as tokens separated by whitespace.

In other places, I have seen people recommend against tokenizing (or scanning or lexing, how some call it), with simplicity being quoted as the main reason.

What are general trade-offs between lexing and not doing it?

解决方案

The most important difference is that lexing will translate your input domain.

A nice result of this is that

  • You do not have to think about whitespace anymore. In a direct (non-lexing) parser, you have to sprinkle space parsers in all places where whitespace is allowed to be, which is easy to forget and it clutters your code if whitespace must separate all your tokens anyway.

  • You can think about your input in a piece-by-piece manner, which is easy for humans.

However, if you do perform lexing, you get the problems that

  • You cannot use common parsers on String anymore - e.g. for parsing a number with a library Function parseFloat :: Parsec String s Float (that operates on a String input stream), you have to do something like takeNextToken :: TokenParser String and execute the parseFloat parser on it, inspecting the parse result (usually Either ErrorMessage a). This is messy to write and limits composability.

  • You have adjust all error messages. If your parser on tokens fails at the 20th token, where in the input string is that? You'll have to manually map error locations back to the input string, which is tedious (in Parsec this means to adjust all SourcePos values).

  • Error reporting is generally worse. Running string "hello" *> space *> float on wrong input like "hello4" will tell you precisely that there is expected whitespace missing after the hello, while a lexer will just claim to have found an "invalid token".

  • Many things that one would expect to be atomic units and to be separated by a lexer are actually pretty "too hard" for a lexer to identify. Take for example String literals - suddenly "hello world" are not two tokens "hello and world" anymore (but only, of course, if quotes are not escaped, like \") - while this is very natural for a parser, it means complicated rules and special cases for a lexer.

  • You cannot re-use parsers on tokens as nicely. If you define how to parse a double out of a String, export it and the rest of the world can use it; they cannot run your (specialized) tokenizer first.

  • You are stuck with it. When you are developing the language to parse, using a lexer might lead you into making early decisions, fixing things that you might want to change afterwards. For example, imagine you defined a language that contains some Float token. At some point, you want to introduce negative literals (-3.4 and - 3.4) - this might not be possible due to the lexer interpreting whitespace as token separator. Using a parser-only approach, you can stay more flexible, making changes to your language easier. This is not really surprising since a parser is a more complex tool that inherently encodes rules.

To summarize, I would recommend writing lexer-free parsers for most cases.

In the end, a lexer is just a "dumbed-down"* parser - if you need a parser anyway, combine them into one.


* From computing theory, we know that all regular languages are also context-free languages; lexers are usually regular, parsers context-free or even context-sensitve (monadic parsers like Parsec can express context-sensitiveness).

这篇关于在使用像Parsec这样的解析器组合器库时,我应该使用词法分析器吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆