在使用像Parsec这样的解析器组合器库时,我应该使用词法分析器吗? [英] Should I use a lexer when using a parser combinator library like Parsec?
问题描述
- 编写一个词法分析器来分割
字符串
输入到标记,然后在[标记]
- 上直接执行解析在
字符串
上写入解析器组合器
第一种方法通常看起来有意义很多解析输入可以理解为由空格分隔的标记。
在其他地方,我看到有人建议不要标记化(或扫描或 lexing ,有人称之为),简单地被引用为主要原因。
lexing和不做的一般权衡是什么它是什么?
最重要的区别是lexing会翻译您的输入域。 >一个很好的结果就是 你没有我已经考虑过空白了。在一个直接的(非lexing)解析器中,你必须在允许空格的地方撒上
space
解析器,这很容易忘记,并且它会混乱你的代码if无论如何,空格必须将所有的所有分开。
然而,如果您确实执行了lexing,您会遇到问题 $ b
-
您不能在
String
上使用公共分析器 - 例如用库函数parseFloat :: Parsec String s Float
(对字符串输入流进行操作)解析数字时,必须执行类似于takeNextToken :: TokenParser String
和执行
parseFloat
解析器,检查解析结果(通常ErrorMessage a
)。这对于编写和限制可组合性来说很混乱。 你已经调整了所有的错误信息。如果您的令牌解析器在第20个令牌上失败,那么在输入字符串中的位置呢?您必须手动将错误位置映射回输入字符串,这很乏味(在Parsec中,这意味着要调整所有 -
错误报告通常较差。运行
字符串hello*>空间*> float
在错误的输入如hello4
上会精确地告诉你在hello $之后缺少空白c $ c>,而词法分析器声称已找到
无效标记
。 -
许多人认为是原子单元并被词法分隔的事实上对于词法分析器来说确实是非常太难了。以字符串文字为例 - 突然
hello world
不是两个标记hello
和世界
了(但是 only ,当然,如果引号没有转义,比如\
) - while这对解析器来说是非常自然的,它意味着复杂的规则和对于词法分析器的特殊情况。 定义如何解析 -
当你开发解析语言时,使用词法分析器可能会导致你做出早期决策,修复之后可能需要更改的东西。例如,假设你定义了一种包含
Float
标记的语言,在某些时候,你想引入负文字(-3.4
和- 3.4
) - 由于词法分析器的原因,这可能是不可能的将空白解释为令牌分隔符。使用仅解析器的方法,您可以保持更灵活,更轻松地更改您的语言。这并不令人惊讶,因为解析器是一个更复杂的工具,它本身编码规则。
SourcePos
值)。 字符串
中的双精度,导出它,其余的人可以使用它;它们不能首先运行您的(专用)标记器。 <总而言之,我会建议在大多数情况下编写无词法分析器。
最后,词法分析器只是一个简单化的解析器 - if无论如何,你需要一个解析器,将它们合并为一个。
*从计算理论,我们知道;词法分析器通常是规则的,分析器上下文无关或甚至是上下文敏感的(像Parsec这样的一元语法分析器可以表示上下文敏感性)。 When writing a parser in a parser combinator library like Haskell's Parsec, you usually have 2 choices: The first method often seems to make sense given that many parsing inputs can be understood as tokens separated by whitespace. In other places, I have seen people recommend against tokenizing (or scanning or lexing, how some call it), with simplicity being quoted as the main reason. What are general trade-offs between lexing and not doing it? The most important difference is that lexing will translate your input domain. A nice result of this is that You do not have to think about whitespace anymore. In a direct (non-lexing) parser, you have to sprinkle You can think about your input in a piece-by-piece manner, which is easy for humans. However, if you do perform lexing, you get the problems that You cannot use common parsers on You have adjust all error messages. If your parser on tokens fails at the 20th token, where in the input string is that? You'll have to manually map error locations back to the input string, which is tedious (in Parsec this means to adjust all Error reporting is generally worse. Running Many things that one would expect to be atomic units and to be separated by a lexer are actually pretty "too hard" for a lexer to identify. Take for example String literals - suddenly You cannot re-use parsers on tokens as nicely. If you define how to parse a double out of a You are stuck with it. When you are developing the language to parse, using a lexer might lead you into making early decisions, fixing things that you might want to change afterwards. For example, imagine you defined a language that contains some To summarize, I would recommend writing lexer-free parsers for most cases. In the end, a lexer is just a "dumbed-down"* parser - if you need a parser anyway, combine them into one. * From computing theory, we know that all regular languages are also context-free languages; lexers are usually regular, parsers context-free or even context-sensitve (monadic parsers like Parsec can express context-sensitiveness). 这篇关于在使用像Parsec这样的解析器组合器库时,我应该使用词法分析器吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
String
input into tokens, then perform parsing on [Token]
String
space
parsers in all places where whitespace is allowed to be, which is easy to forget and it clutters your code if whitespace must separate all your tokens anyway.
String
anymore - e.g. for parsing a number with a library Function parseFloat :: Parsec String s Float
(that operates on a String input stream), you have to do something like takeNextToken :: TokenParser String
and execute
the parseFloat
parser on it, inspecting the parse result (usually Either ErrorMessage a
). This is messy to write and limits composability.SourcePos
values).string "hello" *> space *> float
on wrong input like "hello4"
will tell you precisely that there is expected whitespace missing after the hello
, while a lexer will just claim to have found an "invalid token"
."hello world"
are not two tokens "hello
and world"
anymore (but only, of course, if quotes are not escaped, like \"
) - while this is very natural for a parser, it means complicated rules and special cases for a lexer.String
, export it and the rest of the world can use it; they cannot run your (specialized) tokenizer first.Float
token. At some point, you want to introduce negative literals (-3.4
and - 3.4
) - this might not be possible due to the lexer interpreting whitespace as token separator. Using a parser-only approach, you can stay more flexible, making changes to your language easier. This is not really surprising since a parser is a more complex tool that inherently encodes rules.