优化弹性字符串文字解析 [英] Optimizing flex string literal parsing

查看:93
本文介绍了优化弹性字符串文字解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开始为我的编程语言编写词法分析器.

I am starting writing a lexical analyzer for my programming language.

此语言的字符串文字以"开头,并在遇到未转义的"时结束.除转义序列(通常的\n s,\t s,\" s等)外,其他所有内容(包括换行符)都将保留,并使用一种使用字符的ASCII码(例如\097).

String literals in this language start with a " and end when an unescaped " is encountered. Everything inside (including newlines) is preserved, except escape sequences (the usual \ns, \ts, \"s etc plus a way of escaping a character by using its ASCII code, e.g. \097 or \97).

这是我到目前为止编写的代码:

This is the code I have written so far:

%{
#include <iostream>
#define YY_DECL extern "C" int yylex()

std::string buffstr;
%}
%x SSTATE
%%

\"                   {
                         buffstr.clear();
                         BEGIN(SSTATE);
                     }
<SSTATE>\\[0-9]{1,3} {
                         unsigned code = atoi(yytext + 1);
                         if (code > 255) {
                             std::cerr << "SyntaxError: decimal escape sequence larger than 255 (" << code << ')' << std::endl;
                             exit(1);
                         }
                         buffstr += code;
                     }

<SSTATE>\\a          buffstr += '\a';
<SSTATE>\\b          buffstr += '\b';
<SSTATE>\\f          buffstr += '\f';
<SSTATE>\n           buffstr += '\n';
<SSTATE>\r           buffstr += '\r';
<SSTATE>\t           buffstr += '\t';
<SSTATE>\v           buffstr += '\v';
<SSTATE>\\\\         buffstr += '\\';
<SSTATE>\\\"         buffstr += '\"';
<SSTATE>\\.          {
                         std::cerr << "SyntaxError: invalid escape sequence (" << yytext << ')' << std::endl;
                         exit(1);
                     }
<SSTATE>\"           {
                         std::cout << "Found a string: " << buffstr << std::endl;
                         BEGIN(INITIAL);
                     }
<SSTATE>.            buffstr += yytext[0];

.                    ;

%%

int main(int argc, char** argv) {
    yylex();
}

它可以完美运行,但是如您所见,它并没有特别优化.

It works perfectly, but as you can see it's not particularly optimized.

对于要解析的字符串文字中的每个字符,都将一个字符附加到std :: string一次,这是不理想的.

It's appending a character to a std::string once for each character in the string literal being parsed, which is not ideal.

我想知道是否有更好的方法,例如存储指针并增加长度,然后使用std::string(const char* ptr, size_t lenght)构建字符串的示例.

I wonder if there's a bettere way of doing it, for an example storing a pointer and increasing a lenght and then building the string with std::string(const char* ptr, size_t lenght).

有一个吗?那会是什么?

Is there one? What would be it?

推荐答案

可能是这样的情况,所提供的代码对于所有实际目的而言都足够快,并且您不必担心对其进行优化,直到您真正看到它成为瓶颈为止. .词法扫描,即使效率低下,也很少对编译时间有重要贡献.

It's probably the case that the code provided is sufficiently fast for all practical purposes, and that you should not worry about optimizing it until you actually observe it being a bottleneck. Lexical scans, even inefficient ones, are rarely an important contribution to compile times.

但是,一些优化是直接的.

However, some optimizations are straight-forward.

最简单的方法是观察大多数字符串不包含转义序列.因此,应用通常的优化技术来寻找低洼的果实,我们从一个字符串中处理没有转义序列的字符串开始,甚至没有经过单独的词法状态. [注1]

The easiest one is to observe that most strings do not contain escape sequences. So applying the usual optimization technique of going for the low-lying fruit, we start by handling strings without escape sequences in one single pattern, without even passing through the separate lexical state. [Note 1]

\"[^"\\]*\"   { yylval.str = new std::string(yytext + 1, yyleng - 2); 
                return T_STRING;
              }

(F)lex提供了yyleng,它是找到的令牌的长度,因此,从没有真正的理由使用strlen重新计算长度.在这种情况下,我们不需要在字符串中用双引号引起来,因此我们选择从第二个字符开始的yyleng - 2个字符.

(F)lex provides yyleng which is the length of the token it found, so there is never really any reason to recompute the length with strlen. In this case, we don't want the surrounding double quotes in the string, so we select yyleng - 2 characters starting at the second character.

当然,我们需要处理转义码;我们可以使用类似于您的开始条件.仅当我们在字符串文字中找到转义字符时,才输入此开始条件. [注2]为了解决这种情况,我们依靠(f)lex实施的最大munch 规则,该规则是匹配时间最长的模式击败了其他碰巧匹配的模式.相同的输入点. [注3]由于我们已经匹配了以"开头且在结束" 之前不包含反斜杠的所有令牌,因此我们可以添加非常相似的模式而无需结束语只会在第一个规则不匹配的情况下匹配,因为与结束语的匹配要长一个字符.

Of course, we need to handle the escape codes; we can use a start condition similar to yours to do so. We only enter this start condition when we find an escape character inside the string literal. [Note 2] To catch this case, we rely on the maximal munch rule implemented by (f)lex, which is that the pattern with the longest match beats out any other patterns which happen to match at the same input point. [Note 3] Since we've already matched any token which starts with a " and does not include a backslash before the closing ", we can add a very similar pattern without the closing quote which will only match in case the first rule doesn't, because the match with the closing quote is one character longer.

\"[^"\\]*     { yylval.str = new std::string(yytext + 1, yyleng - 1);
                BEGIN(S_STRING);
                /* No return, so the scanner will continue in the new state */
              }

S_STRING状态下,我们仍然可以匹配不包含反斜杠的序列(不仅仅是单个字符),从而显着减少了动作执行和字符串追加的数量:

In the S_STRING state, we can still match sequences (not just single characters) which don't contain a backslash, thereby reducing significantly the number of action executions and string appends:

(开始条件下的括号模式列表是弹性扩展.)

(Braced pattern lists in a start condition are a flex extension.)

<S_STRING>{
  [^"\\]+       { yylval.str->append(yytext, yyleng); }
  \\n           { (*yylval.str) += '\n'; }
   /* Etc. Handle other escape sequences similarly */
  \\.           { (*yylval.str) += yytext[1]; }
  \\\n          { /* A backslash at the end of the line. Do nothing */ }
  \"            { BEGIN(INITIAL); return T_STRING; }
     /* See below */
}

当我们最终找到与最后一个模式匹配的未转义的双引号时,我们首先重置词法状态,然后返回已完全构造的字符串.

When we eventually find an unescaped double-quote, which will match the last pattern, we first reset the lexical state, and then return the string which has been completely constructed.

模式\\\n实际上与行尾的反斜杠匹配.完全忽略此反斜杠和换行符是很常见的,以便允许长字符串在多个源代码行上继续.如果您不想提供此功能,只需将\.模式更改为\(.|\n).

The pattern \\\n actually matches a backslash at the very end of the line. It's common to completely ignore this backslash and the newline, in order to allow long strings to be continued over several source lines. If you don't want to provide this feature, just change the \. pattern to \(.|\n).

如果我们找不到未转义的双引号怎么办?也就是说,如果偶然省略了双引号怎么办?在这种情况下,我们将以S_STRING起始条件结束,因为字符串没有以引号终止,因此后备模式将匹配.在S_STRING模式中,我们需要添加另外两种可能性:

And what if we don't find an unescaped double-quote? That is, what if the closing double quote was accidentally omitted? We will end up in the S_STRING start condition in this case, since the string was not terminated by a quote, and so the fallback pattern will match. In the S_STRING patterns, we need to add two more possibilities:

<S_STRING>{
    // ... As above
  <<EOF>>      |
  \\           { /* Signal a lexical error */ }
}

这些规则中的第一个捕获了简单的未终止的字符串错误.第二个则捕获了反斜杠后面没有合法字符的情况,鉴于其他规则,只有当反斜杠是程序中最后一个具有未终止字符串的字符时,该规则才会发生.尽管那是不可能的,但是它有可能发生,所以我们应该抓住它.

The first of these rules catches the simple unterminated string error. The second one catches the case in which a backslash was not followed by a legitimate character, which given the other rules can only happen if a backslash is the very last character in a program with an unterminated string. Unlikely though that is, it can happen so we should catch it.

进一步的优化是相对简单的,尽管我不建议这样做,因为它主要只是使代码复杂化,其好处是无穷的. (因此,我没有提供任何示例代码.)

One further optimization is relatively simple, although I wouldn't recommend it because it mostly just complicates the code, and the benefit is infinitesimal. (For this very reason, I haven't included any sample code.)

在开始条件下,反斜杠(几乎)总是导致将单个字符追加到我们要累积的字符串中,这意味着我们可以调整字符串的大小以进行此追加,即使我们只是将其大小调整为追加非转义字符.相反,我们可以在操作中向字符串添加一个与非转义字符匹配的附加字符. (由于(f)lex将输入缓冲区修改为以NUL终止的令牌,因此令牌后的字符将始终为NUL,因此将追加长度增加1将在字符串中插入此NUL而不是反斜杠.但是那不重要.)

In the start condition, a backslash (almost) always results in appending a single character to the string we're accumulating, which means that we might resize the string in order to do this append, even though we just resized it to append the non-escaped characters. Instead, we could add one additional character to the string in the action which matches the non-escape characters. (Because (f)lex modifies the input buffer to NUL-terminate the token, the character following the token will always be a NUL, so increasing the length of the append by one will insert this NUL and not the backslash into the string. But that's not important.)

然后,用于处理转义字符的代码需要替换字符串中的最后一个字符,而不是将单个字符附加到字符串中,从而避免一个附加调用.当然,在我们不想插入任何内容的情况下,我们需要将字符串的大小减少一个字符,并且如果存在一个转义序列(例如unicode转义)会增加一个以上的字节字符串,我们需要做其他一些杂技.

Then the code which handles the escape character needs to replace the last character in the string rather than appending a single character to the string, thereby avoiding one append call. Of course, in the cases where we don't want to insert anything, we'll need to reduce the size of the string by one character, and if there is an escape sequence (such as unicode escapes) which add more than one byte to the string, we'll need to do some other acrobatics.

简而言之,我认为这是黑客,而不是优化.但是就其价值而言,我过去做过这样的事情,所以我也必须对过早的优化负责.

In short, I'd qualify this as a hack more than an optimization. But for what it's worth, I have done things like this in the past, so I have to plead guilty to the charge of premature optimization, too.

  1. 您的代码仅打印出令牌,这使得很难知道将字符串传递给解析器的设计.我在这里假设一种或多或少的标准策略,其中语义值yylval是一个联合,其成员是std::string*( not 一个std::string).我没有解决由此产生的内存管理问题,但是%destruct声明会有所帮助.

  1. Your code only prints out the token, which makes it hard to know what your design is for passing the string to the parser. I'm assuming here one more or less standard strategy in which the semantic value yylval is a union one of whose members is a std::string* (not a std::string). I don't address the resulting memory management issues, but a %destruct declaration will help a lot.

在此答案的原始版本中,我建议通过使用与反斜杠匹配的模式作为尾随上下文来捕获这种情况:

In the original version of this answer, I suggested catching this case by using a pattern which matches a backslash as trailing context:

\"[^"\\]*/\\    { yylval.str = new std::string(yytext + 1, yyleng - 1);
                  BEGIN(S_STRING);
                  /* No return, so the scanner will continue in the new state */
            }

但是使用最大咀嚼规则更简单,更通用.

But using the maximal munch rule is simpler and more general.

如果多个图案的最长匹配相同,则扫描仪描述中的第一个图案将获胜.

If more than one pattern has the same longest match, the first one in the scanner description wins.

这篇关于优化弹性字符串文字解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆