使用C ++ 11正则表达式捕获上下文无关文法文件的内容 [英] Using C++11 regex to capture the contents of a context-free-grammar file

查看:94
本文介绍了使用C ++ 11正则表达式捕获上下文无关文法文件的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写自己的上下文无关语法规范,以与我的词法分析器/解析器的规则相关联.它的含义与 ANTLR 的相似,其中大写标识符分类为Lexer规则,小写的标识符分类为解析器规则.它旨在接受lexer规则的字符串文字和/或正则表达式的任何组合,以及lexer/regex规则和/或解析器规则的其他解析器标识符的任何组合.每个规则的格式为 < identifier>:< expression> ;;

I'm trying to write my own context-free-grammar specification, to associate with the rules of my lexer/parser. It is meant to be similar to that of ANTLR's, where upper-case identifiers classify as a Lexer rule and lower-case identifiers classify as a Parser rule. It is meant to accept any combination of string literals and/or regular expressions for lexer rules, and any combination of lexer/regex rules and/or other parser identifiers for parser rules. Each rule in is the format of <identifier>:<expression>;

这是语法示例:

integer      : DIGIT+;        //parser rule containing at least one lexer rule
twodigits    : DIGIT DIGIT;   //parser rule containing two consecutive lexer rules
DIGIT        : [0-9];         //lexer rule containing regex
string       : '"' CHAR* '"'; //parser rule containing zero or more 
                              //  lexer rules, wrapped in two string literals
CHAR         : (LCHAR|UCHAR); //lexer rule containing two lexer rules which
                              //  will later evaluate to one of two tokens
LCHAR        : [a-z];         //lexer rule containing regex
UCHAR        : [A-Z];         //lexer rule containing regex
SPACE        : ' ';           //lexer rule containing string literal


我遇到的麻烦是匹配表达式字符串的能力,因为它们的内容往往会有所不同.
我本来是这样写的:
([a-zA-Z0-9_]*)(?:\s*)(?:\:)(?:\s*)((?:\'?).*(?:\'?)(?:\;))
作为匹配规则,这对于用单引号引起来的单个字符串文字表达式是可以的,但是我需要扩展它以允许多个非贪婪的字符串文字以及由任意数量的空格分隔的组合语句.我不关心在匹配的表达式中匹配潜在的正则表达式,甚至不捕获表达式的分离部分,因为稍后将通过单独的正则表达式操作来处理它,所以我真的只需要 validate 标识符和表达式...

The trouble I'm having is the ability to match the expression strings, since their contents tend to vary.
I have originally written:
([a-zA-Z0-9_]*)(?:\s*)(?:\:)(?:\s*)((?:\'?).*(?:\'?)(?:\;))
as the match rule, which does okay for a single string literal expression surrounded by single quotes, but I need to expand this to allow for multiple non-greedy string literals, and combined statements separated by any number of whitespace. I am not concerned with matching potential regex's within a matched expression, or even capturing segregated parts of the expression, as this is handled later on by a separate regex operation, so really I just need to validate identifiers and expressions...

所有 ,我需要使用正则表达式使用以下语法对regex_search操作进行浏览以查看语法内容:

All in all, I need the regex_search operation to look through the grammar's contents, using the following syntax for matches:

  • 有效的标识符,以一个或多个小写或大写字母开头,可以选择后面跟任意数量的字母数字字符(只要标识符可以,中间可以包含任意数量的下划线字符)不以一个开头或结尾).
  • 任意数量的空白字符,制表符,换行符等,而不会捕获.
  • 冒号而没有捕获.
  • 任意数量的空白字符,制表符,换行符等,而不会捕获.
  • 至少一个 :(以任意顺序)任意数量的字符串文字(用单引号引起来,但不包含引号),任意数量的词法分析器/解析器标识符,任意数量的正则表达式(括在方括号中).此匹配规则的结果应将整个表达式捕获为单个字符串,稍后将进行后期处理阶段.
  • 任意数量的空白字符,制表符,换行符等,而不会捕获.
  • 分号,可选地后跟任何未捕获的空格.
  • (可选)任意个未捕获的空格,后跟一个捕获的行注释
  • 任意数量的空白字符,制表符,换行符等,而不会捕获.
  • A valid identifier, starting with one or more lower or uppercase letters, optionally followed by any number of alphanumeric characters (which optionally can contain any number of underscore characters inbetween, as long as the identifier does not start or end with one).
  • Any number of whitespace characters, tabs, newlines etc, without capturing it.
  • A colon without capturing it.
  • Any number of whitespace characters, tabs, newlines etc, without capturing it.
  • At least one of: (in any order) any number of string literals (enclosed in single quotes, without capturing the quotes), any number of lexer/parser identifiers, any number of regex's (enclosed in square brackets). The result of this match rule should capture the entire expression as a single string, which will later go through a post-processing stage.
  • Any number of whitespace characters, tabs, newlines etc, without capturing it.
  • A semicolon optionally followed by any uncaptured whitespace.
  • Optionally, any number of uncaptured spaces followed by a single captured line comment
  • Any number of whitespace characters, tabs, newlines etc, without capturing it.

是否可以将其放入单个regex_search操作中?
我已经在 Expresso 中弄乱了,但似乎无法正确完成...

Is it possible to place this into a single regex_search operation?
I've messed around in Expresso and just can't seem to get it right...

到目前为止,我已经能够提出以下建议:

So far, I've been able to come up with the following:

#/////////////////////
# Identifier
#/////////////////////
(
    (?:[a-zA-Z]+)           # At least one lower/uppercase letter
    (?:
        (?:[a-zA-Z0-9_]*)   # Zero or more alphanumeric/underscore characters,
        (?:\w+)             # explicitly followed by one or more alphanumeric
    )?                      #   characters
)

#/////////////////////
# Separator
#/////////////////////
(?:\s*)                     # Any amount of uncaptured whitespace
(?:\:)                      # An uncaptured colon
(?:\s*)                     # Any amount of uncaptured whitespace

#///////////////////////
# Expression
#///////////////////////
(
    # String Literals:
    (?:\'?)                 # An optional single quote,
    (?:                     #   which is meant to start and end a string
        (?:[^'\\] | \\.)*   #   literal, but issues several problems for
    )                       #   me (see comments below, after this code block)
    (?:\'?)
    # Other expressions
    # ????????????
)

#/////////////////////
# Line End
#/////////////////////
(?:\s*)                     # Any amount of uncaptured whitespace
(?:\;)                      # An uncaptured colon
(?:\s*)                     # Any amount of uncaptured whitespace

正如您所看到的,我的标识符分隔符 line-ends 都运行良好.但是表情是我完全陷入困境的地方!

如何告诉正则表达式库,我希望 E 一个非贪婪的字符串文字 OR 行尾之前的任何字符集 AND 以任意顺序显示任意数量的

即使我只允许使用单字符串文字,我怎么会说如果第一个引号存在,则右引号不是可选的" ?

As you can see, I have identifiers, separators and line-ends working perfectly. But expressions are where I'm totally stuck!

How can I tell the regex library that I want EITHER a non-greedy string literal, OR any set of characters before the Line End, AND any number of them in any order?

Even if I only allowed a single string literal, how would I say "The closing single quote is NOT optional if the first one exists"?

推荐答案

它可能并非完美无缺,并且可能需要对匹配结果的处理方式进行额外的编码,但这似乎可行:

It might not be flawless, and may require additional coding in how match results are handled, but this appears to work:

#/////////////////////
# Identifier
#/////////////////////
(
    (?:[a-zA-Z]+)
    (?:
        (?:[a-zA-Z0-9_]*)
        (?:\w+)
    )?
)

#/////////////////////
# Separator
#/////////////////////
(?:\s*\:\s*)

#///////////////////////
# Expression
#///////////////////////
(
    '(?:\\\\.|[^'])*'|[^']+     # Might need to be processed separately
)

#/////////////////////
# Line End
#/////////////////////
(?:\s*\;\s*)

这篇关于使用C ++ 11正则表达式捕获上下文无关文法文件的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆