如何让一个普通的前pression字符串的AST? [英] How to get the AST of a regular expression string?
问题描述
我怎样才能得到一个普通的前pression的抽象语法树(AST)(在C ++)?
How can I get the abstract syntax tree (AST) of a regular expression (in C++)?
例如,
(XYZ)|(123)
应该产生的树:
|
/ \
. .
/ \ / \
. Z . 3
/ \ / \
X Y 1 2
有一个的boost ::精神
语法解析定期EX pression模式?在的boost ::正则表达式
图书馆应该有它,但我没有找到它。是否有任何其他开源工具可用,这将使我的正则表达式的抽象重新presentation?
Is there a boost::spirit
grammar to parse regular expression patterns? The boost::regex
library should have it, but I didn't find it. Are there any other open-source tools available that would give me the abstract representation of a regex?
推荐答案
我一头栽进这个问题又来。和我决定看看它实际上将是多么难写一个解析器定期EX pression语法与升压精神显著子集。
I stumbled into this question again. And I decided to take a look at how hard it would actually be to write a parser for a significant subset of regular expression syntax with Boost Spirit.
所以,像往常一样,我开始了笔和纸,并在一段时间后,脑子里想的一些规则草案。时间绘制类似的AST起来:
So, as usual, I started out with pen and paper, and after a while had some draft rules in mind. Time to draw the analogous AST up:
namespace ast
{
struct multiplicity
{
unsigned minoccurs;
boost::optional<unsigned> maxoccurs;
bool greedy;
multiplicity(unsigned minoccurs = 1, boost::optional<unsigned> maxoccurs = 1)
: minoccurs(minoccurs), maxoccurs(maxoccurs), greedy(true)
{ }
bool unbounded() const { return !maxoccurs; }
bool repeating() const { return !maxoccurs || *maxoccurs > 1; }
};
struct charset
{
bool negated;
using range = boost::tuple<char, char>; // from, till
using element = boost::variant<char, range>;
std::set<element> elements;
// TODO: single set for loose elements, simplify() method
};
struct start_of_match {};
struct end_of_match {};
struct any_char {};
struct group;
typedef boost::variant< // unquantified expression
start_of_match,
end_of_match,
any_char,
charset,
std::string, // literal
boost::recursive_wrapper<group> // sub expression
> simple;
struct atom // quantified simple expression
{
simple expr;
multiplicity mult;
};
using sequence = std::vector<atom>;
using alternative = std::vector<sequence>;
using regex = boost::variant<atom, sequence, alternative>;
struct group {
alternative root;
group() = default;
group(alternative root) : root(std::move(root)) { }
};
}
这是你的,它通过变量
和可选的精神(由于与提升整合效果很好典型的AST(58 LOC)
,以及具有战略性的选择构造函数)。
This is your typical AST (58 LoC) that works well with Spirit (due to integrating with boost via variant
and optional
, as well as having strategically chosen constructors).
语法最后我们只稍长一点:
The grammar ended up being only slightly longer:
template <typename It>
struct parser : qi::grammar<It, ast::alternative()>
{
parser() : parser::base_type(alternative)
{
using namespace qi;
using phx::construct;
using ast::multiplicity;
alternative = sequence % '|';
sequence = *atom;
simple =
(group)
| (charset)
| ('.' >> qi::attr(ast::any_char()))
| ('^' >> qi::attr(ast::start_of_match()))
| ('$' >> qi::attr(ast::end_of_match()))
// optimize literal tree nodes by grouping unquantified literal chars
| (as_string [ +(literal >> !char_("{?+*")) ])
| (as_string [ literal ]) // lone char/escape + explicit_quantifier
;
atom = (simple >> quantifier); // quantifier may be implicit
explicit_quantifier =
// bounded ranges:
lit('?') [ _val = construct<multiplicity>( 0, 1) ]
| ('{' >> uint_ >> '}' ) [ _val = construct<multiplicity>(_1, _1) ]
// repeating ranges can be marked non-greedy:
| (
lit('+') [ _val = construct<multiplicity>( 1, boost::none) ]
| lit('*') [ _val = construct<multiplicity>( 0, boost::none) ]
| ('{' >> uint_ >> ",}") [ _val = construct<multiplicity>(_1, boost::none) ]
| ('{' >> uint_ >> "," >> uint_ >> '}') [ _val = construct<multiplicity>(_1, _2) ]
| ("{," >> uint_ >> '}' ) [ _val = construct<multiplicity>( 0, _1) ]
) >> -lit('?') [ phx::bind(&multiplicity::greedy, _val) = false ]
;
quantifier = explicit_quantifier | attr(ast::multiplicity());
charset = '['
>> (lit('^') >> attr(true) | attr(false)) // negated
>> *(range | charset_el)
> ']'
;
range = charset_el >> '-' >> charset_el;
group = '(' >> alternative >> ')';
literal = unescape | ~char_("\\+*?.^$|{()") ;
unescape = ('\\' > char_);
// helper to optionally unescape waiting for raw ']'
charset_el = !lit(']') >> (unescape|char_);
}
private:
qi::rule<It, ast::alternative()> alternative;
qi::rule<It, ast::sequence()> sequence;
qi::rule<It, ast::atom()> atom;
qi::rule<It, ast::simple()> simple;
qi::rule<It, ast::multiplicity()> explicit_quantifier, quantifier;
qi::rule<It, ast::charset()> charset;
qi::rule<It, ast::charset::range()> range;
qi::rule<It, ast::group()> group;
qi::rule<It, char()> literal, unescape, charset_el;
};
现在,真正的乐趣是做与AST的东西。既然你要显示的树,我想从AST生成DOT图形。所以,我所做的:
Now, the real fun is to do something with the AST. Since you want to visualize the tree, I thought of generating DOT graph from the AST. So I did:
int main()
{
std::cout << "digraph common {\n";
for (std::string pattern: {
"abc?",
"ab+c",
"(ab)+c",
"[^-a\\-f-z\"\\]aaaa-]?",
"abc|d",
"a?",
".*?(a|b){,9}?",
"(XYZ)|(123)",
})
{
std::cout << "// ================= " << pattern << " ========\n";
ast::regex tree;
if (doParse(pattern, tree))
{
check_roundtrip(tree, pattern);
regex_todigraph printer(std::cout, pattern);
boost::apply_visitor(printer, tree);
}
}
std::cout << "}\n";
}
这个程序的结果如下图:
This program results in the following graphs:
自边缘描绘重复和颜色指示是否匹配是贪婪的(红色)或非贪婪(蓝色)。正如你可以看到我已经优化了AST有点为清楚起见,但(UN)征求意见的相关行会赚差价:
The self-edges depict repeats and the colour indicates whether the match is greedy (red) or non-greedy (blue). As you can see I've optimized the AST a bit for clarity, but (un)commenting the relevant lines will make the difference:
我想会不会太难调。希望这将作为灵感的人。
I think it wouldn't be too hard to tune. Hopefully it will serve as inspiration to someone.
全部code这个要点是: https://gist.github.com/sehe/8678988
Full code at this gist: https://gist.github.com/sehe/8678988
这篇关于如何让一个普通的前pression字符串的AST?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!