鉴于我已经实现了基本的正则表达式匹配器，如何实现词法分析器？ [英] How do I implement a lexer given that I have already implemented a basic regular expression matcher?

查看：208 发布时间：2020/6/3 20:37:33 regex algorithm lexer

本文介绍了鉴于我已经实现了基本的正则表达式匹配器，如何实现词法分析器？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试实现一个词法分析程序，以达到乐趣。我已经实现了基本的正则表达式匹配器（首先将模式转换为NFA，然后转换为DFA）。现在我对如何进行一无所知。

我的词法分析器将获取标记及其对应的正则表达式的列表。以此创建词法分析器的通用算法是什么？

我考虑过（或）所有正则表达式，但是后来我无法确定匹配了哪个特定令牌。即使我扩展了regex模块，以在匹配成功时返回匹配的模式，如何在匹配器中实现超前查找？

I'm trying to implement a lexer for fun. I have already implemented a basic regular expression matcher(by first converting a pattern to a NFA and then to a DFA). Now I'm clueless about how to proceed.

My lexer would be taking a list of tokens and their corresponding regexs. What is the general algorithm used to create a lexer out of this?

I thought about (OR)ing all the regex, but then I can't identify which specific token was matched. Even if I extend my regex module to return the pattern matched when a match is successful, how do I implement lookahead in the matcher?

推荐答案

假设您有一个正常的正则表达式，则 regex_match 返回一个布尔值（如果字符串满足正则表达式则为True）。首先，您需要有一个令牌的有序列表（每个都有正则表达式） tokens_regex ，该顺序很重要，因为 order将规定优先级。

Assuming you have a working regex, regex_match which returns a boolean (True if a string satisfies the regex). First, you need to have an ordered list of tokens (with regex for each) tokens_regex, the order being important as order will prescribe precedence.

一种算法可以是（不一定是唯一的一种算法）：

编写一个过程 next_token ，该过程接受一个字符串，并返回第一个标记，其值和剩余的字符串（或者-如果是非法/忽略字符-无，包括冒犯的字符和其余的字符串）。 注意：这必须尊重优先级，并且应该找到最长的令牌。

编写过程 lex 递归调用 next_token 。

Write a procedure next_token which takes a string, and returns the first token, its value and the remaining string (or - if an illegal/ignore character - None, the offending character and the remaining string). Note: this has to respect precedence, and should find the longest token.
Write a procedure lex which recursively calls next_token.

。

类似这样的内容（用Python编写）：

Something like this (written in Python):

tokens_regex = [ (TOKEN_NAME, TOKEN_REGEX),...] #order describes precedence

def next_token( remaining_string ):
    for t_name, t_regex in tokens_regex: # check over in order of precedence
        for i in xrange( len(remaining_string), 0, -1 ): #check longest possibilities first (there may be a more efficient method).
            if regex_match( remaining_string[:i], t_regex ):
                return t_name, remaining_string[:i], remaining_string[i:]
    return None, remaining_string[0], remaining_string[1:] #either an ignore or illegal character

def lex( string ):
    tokens_so_far = []
    remaining_string = string
    while len(remaining_string) > 0:
        t_name, t_value, string_remaining = next_token(remaining_string)
        if t_name is not None:
            tokens_so_far.append(t_name, t_value)
        #elif not regex_match(t_value,ignore_regex):
            #check against ignore regex, if not in it add to an error list/illegal characters
   return tokens_so_far

需要添加一些改进词法的方法：忽略正则表达式，错误列表和位置/行号（对于这些错误或令牌）。

Some things to add to improve your lexer: ignore regex, error lists and locations/line numbers (for these errors or for tokens).

玩得开心！祝你好运，做一个解析器：）。

这篇关于鉴于我已经实现了基本的正则表达式匹配器，如何实现词法分析器？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

鉴于我已经实现了基本的正则表达式匹配器，如何实现词法分析器？ [英] How do I implement a lexer given that I have already implemented a basic regular expression matcher?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

鉴于我已经实现了基本的正则表达式匹配器，如何实现词法分析器？ [英] How do I implement a lexer given that I have already implemented a basic regular expression matcher?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭