如何让一个普通的前pression字符串的AST? [英] How to get the AST of a regular expression string?

查看:109
本文介绍了如何让一个普通的前pression字符串的AST?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我怎样才能得到一个普通的前pression的抽象语法树(AST)(在C ++)?

How can I get the abstract syntax tree (AST) of a regular expression (in C++)?

例如,

 (XYZ)|(123)

应该产生的树:

        |
      /   \
    .       .
   / \     / \
  .   Z   .   3    
 / \     / \   
X  Y     1 2

有一个的boost ::精神语法解析定期EX pression模式?在的boost ::正则表达式图书馆应该有它,但我没有找到它。是否有任何其他开源工具可用,这将使我的正则表达式的抽象重新presentation?

Is there a boost::spirit grammar to parse regular expression patterns? The boost::regex library should have it, but I didn't find it. Are there any other open-source tools available that would give me the abstract representation of a regex?

推荐答案

我一头栽进这个问题又来。和我决定看看它实际上将是多么难写一个解析器定期EX pression语法与升压精神显著子集。

I stumbled into this question again. And I decided to take a look at how hard it would actually be to write a parser for a significant subset of regular expression syntax with Boost Spirit.

所以,像往常一样,我开始了笔和纸,并在一段时间后,脑子里想的一些规则草案。时间绘制类似的AST起来:

So, as usual, I started out with pen and paper, and after a while had some draft rules in mind. Time to draw the analogous AST up:

namespace ast
{
    struct multiplicity 
    {
        unsigned minoccurs;
        boost::optional<unsigned> maxoccurs;
        bool greedy;

        multiplicity(unsigned minoccurs = 1, boost::optional<unsigned> maxoccurs = 1) 
            : minoccurs(minoccurs), maxoccurs(maxoccurs), greedy(true)
        { }

        bool unbounded() const { return !maxoccurs; }
        bool repeating() const { return !maxoccurs || *maxoccurs > 1; }
    };

    struct charset
    {
        bool negated;

        using range   = boost::tuple<char, char>; // from, till
        using element = boost::variant<char, range>;

        std::set<element> elements; 
        // TODO: single set for loose elements, simplify() method
    };

    struct start_of_match {};
    struct end_of_match {};
    struct any_char {};
    struct group;

    typedef boost::variant<   // unquantified expression
        start_of_match,
        end_of_match,
        any_char,
        charset,
        std::string,          // literal
        boost::recursive_wrapper<group> // sub expression
    > simple;

    struct atom               // quantified simple expression
    {
        simple       expr;
        multiplicity mult;
    };

    using sequence    = std::vector<atom>;
    using alternative = std::vector<sequence>;
    using regex       = boost::variant<atom, sequence, alternative>;

    struct group {
        alternative root;

        group() = default;
        group(alternative root) : root(std::move(root)) { }
    };
}

这是你的,它通过变量可选的精神(由于与提升整合效果很好典型的AST(58 LOC),以及具有战略性的选择构造函数)。

This is your typical AST (58 LoC) that works well with Spirit (due to integrating with boost via variant and optional, as well as having strategically chosen constructors).

语法最后我们只稍长一点:

The grammar ended up being only slightly longer:

template <typename It>
    struct parser : qi::grammar<It, ast::alternative()>
{
    parser() : parser::base_type(alternative)
    {
        using namespace qi;
        using phx::construct;
        using ast::multiplicity;

        alternative = sequence % '|';
        sequence    = *atom;

        simple      = 
                      (group)
                    | (charset)
                    | ('.' >> qi::attr(ast::any_char()))
                    | ('^' >> qi::attr(ast::start_of_match()))
                    | ('$' >> qi::attr(ast::end_of_match()))
                    // optimize literal tree nodes by grouping unquantified literal chars
                    | (as_string [ +(literal >> !char_("{?+*")) ])
                    | (as_string [ literal ]) // lone char/escape + explicit_quantifier
                    ;

        atom        = (simple >> quantifier); // quantifier may be implicit

        explicit_quantifier  =
                    // bounded ranges:
                      lit('?')                                   [ _val = construct<multiplicity>( 0, 1)   ]
                    | ('{'  >> uint_ >> '}' )                    [ _val = construct<multiplicity>(_1, _1)  ]
                    // repeating ranges can be marked non-greedy:
                    | (                                        
                          lit('+')                               [ _val = construct<multiplicity>( 1, boost::none) ]
                        | lit('*')                               [ _val = construct<multiplicity>( 0, boost::none) ]
                        | ('{'  >> uint_ >> ",}")                [ _val = construct<multiplicity>(_1, boost::none) ]
                        | ('{'  >> uint_ >> "," >> uint_ >> '}') [ _val = construct<multiplicity>(_1, _2)  ]
                        | ("{," >> uint_ >> '}' )                [ _val = construct<multiplicity>( 0, _1)  ]
                      ) >> -lit('?')       [ phx::bind(&multiplicity::greedy, _val) = false ]
                    ;

        quantifier = explicit_quantifier | attr(ast::multiplicity());

        charset     = '[' 
                   >> (lit('^') >> attr(true) | attr(false)) // negated
                   >> *(range | charset_el)
                    > ']'
                    ;

        range       = charset_el >> '-' >> charset_el;

        group       = '(' >> alternative >> ')';

        literal     = unescape | ~char_("\\+*?.^$|{()") ;

        unescape    = ('\\' > char_);

        // helper to optionally unescape waiting for raw ']'
        charset_el  = !lit(']') >> (unescape|char_);
    }

  private:
    qi::rule<It, ast::alternative()>    alternative;
    qi::rule<It, ast::sequence()>       sequence;
    qi::rule<It, ast::atom()>           atom;
    qi::rule<It, ast::simple()>         simple;
    qi::rule<It, ast::multiplicity()>   explicit_quantifier, quantifier;
    qi::rule<It, ast::charset()>        charset;
    qi::rule<It, ast::charset::range()> range;
    qi::rule<It, ast::group()>          group;
    qi::rule<It, char()>                literal, unescape, charset_el;
};

现在,真正的乐趣是做与AST的东西。既然你要显示的树,我想从AST生成DOT图形。所以,我所做的:

Now, the real fun is to do something with the AST. Since you want to visualize the tree, I thought of generating DOT graph from the AST. So I did:

int main()
{
    std::cout << "digraph common {\n";

    for (std::string pattern: { 
            "abc?",
            "ab+c",
            "(ab)+c",
            "[^-a\\-f-z\"\\]aaaa-]?",
            "abc|d",
            "a?",
            ".*?(a|b){,9}?",
            "(XYZ)|(123)",
        })
    {
        std::cout << "// ================= " << pattern << " ========\n";
        ast::regex tree;
        if (doParse(pattern, tree))
        {
            check_roundtrip(tree, pattern);

            regex_todigraph printer(std::cout, pattern);
            boost::apply_visitor(printer, tree);
        }
    }

    std::cout << "}\n";
}

这个程序的结果如下图:

This program results in the following graphs:

自边缘描绘重复和颜色指示是否匹配是贪婪的(红色)或非贪婪(蓝色)。正如你可以看到我已经优化了AST有点为清楚起见,但(UN)征求意见的相关行会赚差价:

The self-edges depict repeats and the colour indicates whether the match is greedy (red) or non-greedy (blue). As you can see I've optimized the AST a bit for clarity, but (un)commenting the relevant lines will make the difference:

我想会不会太难调。希望这将作为灵感的人。

I think it wouldn't be too hard to tune. Hopefully it will serve as inspiration to someone.

全部code这个要点是: https://gist.github.com/sehe/8678988

Full code at this gist: https://gist.github.com/sehe/8678988

这篇关于如何让一个普通的前pression字符串的AST?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆