如何手工写(壳)词法分析器 [英] How to write a (shell) lexer by hand
问题描述
我工作的一个壳,小bash类似的外壳,无需脚本(如果在...)
我必须作出手工词法分析器/分析器(LL)。
I'm working on a shell, a small bash-like shell, without scripting (if while ...) I have to make the lexer/parser (LL) by hand.
让词法分析器将改变命令(的char * CMD ),以一个链表( t_list *名单)。
而LL解析器将改变链表( t_list *名单)至AST(二叉树的 t_btree *根)以<一个href=\"http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_10\">grammar
So the lexer will transform the command (char *cmd) to a linked list (t_list *list). And the LL parser will transform the linked list (t_list *list) to an AST (binary tree t_btree *root) with a grammar
所以,我知道如何使LL解析器,但我不知道如何来标记我的命令。
So, I know how to make the LL parser but I don't know how to tokenize my command.
例如: PS | grep的LS&GT;&GT;文件;让&功放;&安培; ./a.out
=> 'PS''|' 'grep'可以'ls'的'&GT;&GT;' '文件'';' ''做'和;&放大器;' ./a.out
感谢。
(我不希望使用任何发电机)
(I don't wanna use any generator)
推荐答案
(这解释了暗示的想法 Spudd86 )。
(This explains the idea hinted by Spudd86).
您需要实现一个有限状态机。主要有以下几种状态:
You need to implement a finite state machine. There are the following states:
- 常规状态
- 中的文件名
- 在
&放大器;&安培;
标记 - 在
||
标记
- General state
- Inside a file name
- Inside the
&&
token - Inside the
||
token
对于每个国家和一个输入字符,你必须决定什么是下一个状态,以及是否要输出的令牌。例如:
For each state and next input character, you have to decide what is the next state, and whether to output a token. For example:
- 当前状态:常规;人物: X =>下一个状态:里面的文件名
- 当前状态:里面的文件名;人物:空格 =>下一个状态:常规;输出令牌
- 当前状态:里面的文件名;人物:&安培; =>下一个状态:内 - &放大器;&安培; ;输出令牌
- 当前状态:内 - &放大器;&安培; ;人物:&安培; =>下一个状态:常规;输出令牌
- 当前状态:内 - &放大器;&安培; ;人物: X =>下一个状态:常规;语法错误
- ...(广告nauseum)
- Current state: General; character: x => next state: inside-file-name
- Current state: inside-file-name; character: space => next state: General; output the token
- Current state: inside-file-name; character: & => next state: inside-&&; output the token
- Current state: inside-&&; character: & => next state: General; output the token
- Current state: inside-&&; character: x => next state: General; syntax error
- ... (ad nauseum)
这是很多枯燥的工作制定出所有的规则(有趣的开始,当你必须调试产生的code),所以大多数人使用code发电机做到这一点。
It's much boring work to work out all the rules (the fun starts when you must debug the resulting code), so most people use code generators to do that.
编辑:一些code(抱歉,如果语法是混乱的;在C ++中,我通常程序)
some code (sorry if the syntax is messed-up; i usually program in C++)
enum state {
STATE_GENERAL,
STATE_IN_FILENAME,
...
};
// Many characters are treated the same (e.g. 'x' and 'y') - so use categories
enum character_category
{
CHAR_GENERAL, // can appear in filenames
CHAR_WHITESPACE = ' ',
CHAR_AMPERSAND = '&',
CHAR_PIPE = '|',
CHAR_EOF = EOF,
...
};
character_category translate(int c)
{
switch (c) {
case '&': return CHAR_AMPERSAND;
case ' ': case '\t': case '\n': return CHAR_WHITESPACE;
...
default: return CHAR_GENERAL;
}
}
void do_stuff()
{
character_category cat;
state current_state = STATE_GENERAL;
state next_state;
char token[100];
char token_length = 0;
do {
int c = getchar();
cat = translate(c);
// The following implements a switch on 2 variables
int selector = 1000 * current_state + cat;
switch (selector)
{
case 1000 * STATE_GENERAL + CHAR_GENERAL:
next_state = STATE_IN_FILENAME;
token[token_length++] = c; // append a character to a filename token
break;
case 1000 * STATE_GENERAL + CHAR_WHITESPACE:
next_state = STATE_GENERAL; // do nothing
break;
case 1000 * STATE_GENERAL + CHAR_PIPE:
next_state = STATE_IN_OR_TOKEN; // the first char in '||' or just '|'
break;
// Much repetitive code already; define a macro for the case constants?
// Have to cover all states and all character categories; good luck...
case 1000 * STATE_IN_FILENAME + EOF:
case 1000 * STATE_IN_FILENAME + CHAR_WHITESPACE:
next_state = STATE_GENERAL;
printf("Filename token: %s\n", token);
break;
default:
printf("Bug\n"); // forgot one of the cases?
}
current_state = next_state;
} while (cat != CHAR_EOF);
}
这篇关于如何手工写(壳)词法分析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!