如何让一个普通的前pression字符串的AST？ [英] How to get the AST of a regular expression string?

查看：109 发布时间：2016/8/12 17:29:17 c++ regex boost

本文介绍了如何让一个普通的前pression字符串的AST？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我怎样才能得到一个普通的前pression的抽象语法树（AST）（在C ++）？

How can I get the abstract syntax tree (AST) of a regular expression (in C++)?

例如，

 (XYZ)|(123)

应该产生的树：

        |
      /   \
    .       .
   / \     / \
  .   Z   .   3    
 / \     / \   
X  Y     1 2

有一个的boost ::精神语法解析定期EX pression模式？在的boost ::正则表达式图书馆应该有它，但我没有找到它。是否有任何其他开源工具可用，这将使我的正则表达式的抽象重新presentation？

Is there a boost::spirit grammar to parse regular expression patterns? The boost::regex library should have it, but I didn't find it. Are there any other open-source tools available that would give me the abstract representation of a regex?

推荐答案

我一头栽进这个问题又来。和我决定看看它实际上将是多么难写一个解析器定期EX pression语法与升压精神显著子集。

I stumbled into this question again. And I decided to take a look at how hard it would actually be to write a parser for a significant subset of regular expression syntax with Boost Spirit.

所以，像往常一样，我开始了笔和纸，并在一段时间后，脑子里想的一些规则草案。时间绘制类似的AST起来：

So, as usual, I started out with pen and paper, and after a while had some draft rules in mind. Time to draw the analogous AST up:

namespace ast
{
    struct multiplicity 
    {
        unsigned minoccurs;
        boost::optional<unsigned> maxoccurs;
        bool greedy;

        multiplicity(unsigned minoccurs = 1, boost::optional<unsigned> maxoccurs = 1) 
            : minoccurs(minoccurs), maxoccurs(maxoccurs), greedy(true)
        { }

        bool unbounded() const { return !maxoccurs; }
        bool repeating() const { return !maxoccurs || *maxoccurs > 1; }
    };

    struct charset
    {
        bool negated;

        using range   = boost::tuple<char, char>; // from, till
        using element = boost::variant<char, range>;

        std::set<element> elements; 
        // TODO: single set for loose elements, simplify() method
    };

    struct start_of_match {};
    struct end_of_match {};
    struct any_char {};
    struct group;

    typedef boost::variant<   // unquantified expression
        start_of_match,
        end_of_match,
        any_char,
        charset,
        std::string,          // literal
        boost::recursive_wrapper<group> // sub expression
    > simple;

    struct atom               // quantified simple expression
    {
        simple       expr;
        multiplicity mult;
    };

    using sequence    = std::vector<atom>;
    using alternative = std::vector<sequence>;
    using regex       = boost::variant<atom, sequence, alternative>;

    struct group {
        alternative root;

        group() = default;
        group(alternative root) : root(std::move(root)) { }
    };
}

这是你的，它通过变量和可选的精神（由于与提升整合效果很好典型的AST（58 LOC），以及具有战略性的选择构造函数）。

This is your typical AST (58 LoC) that works well with Spirit (due to integrating with boost via variant and optional, as well as having strategically chosen constructors).

语法最后我们只稍长一点：

The grammar ended up being only slightly longer:

template <typename It>
    struct parser : qi::grammar<It, ast::alternative()>
{
    parser() : parser::base_type(alternative)
    {
        using namespace qi;
        using phx::construct;
        using ast::multiplicity;

        alternative = sequence % '|';
        sequence    = *atom;

        simple      = 
                      (group)
                    | (charset)
                    | ('.' >> qi::attr(ast::any_char()))
                    | ('^' >> qi::attr(ast::start_of_match()))
                    | ('$' >> qi::attr(ast::end_of_match()))
                    // optimize literal tree nodes by grouping unquantified literal chars
                    | (as_string [ +(literal >> !char_("{?+*")) ])
                    | (as_string [ literal ]) // lone char/escape + explicit_quantifier
                    ;

        atom        = (simple >> quantifier); // quantifier may be implicit

        explicit_quantifier  =
                    // bounded ranges:
                      lit('?')                                   [ _val = construct<multiplicity>( 0, 1)   ]
                    | ('{'  >> uint_ >> '}' )                    [ _val = construct<multiplicity>(_1, _1)  ]
                    // repeating ranges can be marked non-greedy:
                    | (                                        
                          lit('+')                               [ _val = construct<multiplicity>( 1, boost::none) ]
                        | lit('*')                               [ _val = construct<multiplicity>( 0, boost::none) ]
                        | ('{'  >> uint_ >> ",}")                [ _val = construct<multiplicity>(_1, boost::none) ]
                        | ('{'  >> uint_ >> "," >> uint_ >> '}') [ _val = construct<multiplicity>(_1, _2)  ]
                        | ("{," >> uint_ >> '}' )                [ _val = construct<multiplicity>( 0, _1)  ]
                      ) >> -lit('?')       [ phx::bind(&multiplicity::greedy, _val) = false ]
                    ;

        quantifier = explicit_quantifier | attr(ast::multiplicity());

        charset     = '[' 
                   >> (lit('^') >> attr(true) | attr(false)) // negated
                   >> *(range | charset_el)
                    > ']'
                    ;

        range       = charset_el >> '-' >> charset_el;

        group       = '(' >> alternative >> ')';

        literal     = unescape | ~char_("\\+*?.^$|{()") ;

        unescape    = ('\\' > char_);

        // helper to optionally unescape waiting for raw ']'
        charset_el  = !lit(']') >> (unescape|char_);
    }

  private:
    qi::rule<It, ast::alternative()>    alternative;
    qi::rule<It, ast::sequence()>       sequence;
    qi::rule<It, ast::atom()>           atom;
    qi::rule<It, ast::simple()>         simple;
    qi::rule<It, ast::multiplicity()>   explicit_quantifier, quantifier;
    qi::rule<It, ast::charset()>        charset;
    qi::rule<It, ast::charset::range()> range;
    qi::rule<It, ast::group()>          group;
    qi::rule<It, char()>                literal, unescape, charset_el;
};

现在，真正的乐趣是做与AST的东西。既然你要显示的树，我想从AST生成DOT图形。所以，我所做的：

Now, the real fun is to do something with the AST. Since you want to visualize the tree, I thought of generating DOT graph from the AST. So I did:

int main()
{
    std::cout << "digraph common {\n";

    for (std::string pattern: { 
            "abc?",
            "ab+c",
            "(ab)+c",
            "[^-a\\-f-z\"\\]aaaa-]?",
            "abc|d",
            "a?",
            ".*?(a|b){,9}?",
            "(XYZ)|(123)",
        })
    {
        std::cout << "// ================= " << pattern << " ========\n";
        ast::regex tree;
        if (doParse(pattern, tree))
        {
            check_roundtrip(tree, pattern);

            regex_todigraph printer(std::cout, pattern);
            boost::apply_visitor(printer, tree);
        }
    }

    std::cout << "}\n";
}

这个程序的结果如下图：

This program results in the following graphs:

自边缘描绘重复和颜色指示是否匹配是贪婪的（红色）或非贪婪（蓝色）。正如你可以看到我已经优化了AST有点为清楚起见，但（UN）征求意见的相关行会赚差价：

The self-edges depict repeats and the colour indicates whether the match is greedy (red) or non-greedy (blue). As you can see I've optimized the AST a bit for clarity, but (un)commenting the relevant lines will make the difference:

我想会不会太难调。希望这将作为灵感的人。

I think it wouldn't be too hard to tune. Hopefully it will serve as inspiration to someone.

全部code这个要点是： https://gist.github.com/sehe/8678988

Full code at this gist: https://gist.github.com/sehe/8678988

这篇关于如何让一个普通的前pression字符串的AST？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何让一个普通的前pression字符串的AST？ [英] How to get the AST of a regular expression string?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

如何让一个普通的前pression字符串的AST？ [英] How to get the AST of a regular expression string?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭