如何使Boost.Spirit.Lex标记值为匹配序列的子字符串(最好是通过regex匹配组) [英] How to make Boost.Spirit.Lex token value be a substring of matched sequence (preferably by regex matching group)

查看:255
本文介绍了如何使Boost.Spirit.Lex标记值为匹配序列的子字符串(最好是通过regex匹配组)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在写一个简单的表达式解析器。它是基于Boost.Spirit.Qi语法基于Boost.Spirit.Lex令牌(Boost版本1.56)构建的。



令牌定义如下:

 使用命名空间boost :: spirit; 

template<
typename lexer_t
>
struct tokens
:lex :: lexer< lexer_t>
{
tokens()
:/ * ... * /,
变量(%(\\w +))
{
this-> self =
/ * ... * / |
variable;
}

/ * ... * /
lex :: token_def< std :: string>变量;
};

现在我想要变量只是不带前缀符号的名称(匹配组(\\w +)






使用匹配组本身并没有帮助。仍然值是完整字符串,包括前缀



有任何方法强制使用匹配的组?



或者至少以某种方式在令牌操作中引用它?






我也尝试使用这样的动作:

  variable [lex :: _ val = std :: string lex :: _ start + 1,lex :: _ end)] 

但无法编译。错误声明 std :: string 构造函数重载不能匹配参数:

 (const boost :: phoenix :: actor< Expr> const boost :: spirit :: lex :: _ end_type)






更简单

  = std :: string(lex :: _ start,lex :: _ end)] 

由于类似的原因,只有第一个参数类型为 boost :: spirit :: lex :: _ start_type






最后,我试过这个(虽然看起来像一个大浪费):

  lex: :_val = std :: string(lex :: _ val).erase(0,1)

也没有编译。此时编译器无法将 const boost :: spirit :: lex :: _ val_type 转换为 std :: string






有办法处理这个问题吗?

解决方案

简单解决方案



正确形式构建 std :: string 属性值如下:

  variable [lex :: _ val = boost :: phoenix :: construct< std :: string> (lex :: _ start + 1,lex :: _ end)] 

=http://stackoverflow.com/users/2417774/jv> jv_ 在他(或她)的 comment



code> boost :: phoenix :: construct 由< boost / phoenix / object / construct.hpp> 头提供。或使用< boost / phoenix.hpp>



正则表达式解决方案



上面的解决方案只适用于简单的情况。并且排除了从外部提供模式(特别是从配置数据)的可能性。由于更改模式,例如%(\\w +)%将需要更改值构造代码。



这就是为什么从定义标记的正则表达式中引用捕获组更好。

现在请注意,这仍然是不完美的,因为像%(\\w +)%(\\w +)%的奇怪情况仍然需要更改代码才能正确处理。这可以通过配置不仅令牌的正则表达式,而且意味着从匹配的范围形成值来解决。然而,这超出了问题的范围。在许多情况下,使用捕获组似乎足够灵活。



请在评论其他地方,没有办法使用捕获组从令牌的正则表达式。更不用说,令牌实际上只支持正则表达式的一个子集。 (其中显着的差异是例如缺乏对命名捕获组的支持或忽略它们。)



我自己在这方面的实验也支持。没有办法悲伤地使用捕获组。有一种解决方法 - 您只需在操作中重新应用正则表达式。



操作获取捕获范围



为了使它有点模块化,让我们从一个最简单的任务开始 - 一个动作返回 boost :: iterator_range 部分的令牌的匹配对应于指定的捕获。

  template< typename Attribute,typename Char,typename Idtype> 
class basic_get_capture
{
public:
typedef lex :: token_def< Attribute,Char,Idtype> token_type;
typedef boost :: basic_regex< Char> regex_type;

explicit basic_get_capture(token_type const& token,int capture_index = 1)
:token(token),
regex(),
capture_index(capture_index)
{
}

模板< typename迭代器,typename IdType,typename上下文>
boost :: iterator_range< Iterator> operator()(Iterator& first,Iterator& last,lex :: pass_flags& / * flag * /,IdType& / * id * /,Context& / * context * /)
{
typedef boost :: match_results< Iterator> match_results_type;

match_results_type results;
regex_match(first,last,results,get_regex());
typename match_results_type :: const_reference capture = results [capture_index];
return boost :: iterator_range< Iterator>(capture.first,capture.second);
}

private:
regex_type& get_regex()
{
if(regex.empty())
{
token_type :: string_type const& regex_text = token.definition();
regex.assign(regex_text);
}
return regex;
}

token_type const&令牌
regex_type regex;
int capture_index;
};

template< typename Attribute,typename Char,typename Idtype>
basic_get_capture< Attribute,Char,Idtype> get_capture(lex :: token_def< Attribute,Char,Idtype> const& token,int capture_index = 1)
{
return basic_get_capture< Attribute,Char,Idtype>(token,capture_index);
}

此操作使用 Boost.Regex (include < boost / regex.hpp> )。



Action获取捕获为字符串



现在捕获范围是一个很好的事情,因为它不分配任何新的内存对于字符串,它是我们想要的最后的字符串。

  template< typename Attribute,typename Char,typename Idtype> 
class basic_get_capture_as_string
{
public:
typedef basic_get_capture< Attribute,Char,Idtype> basic_get_capture_type;
typedef typename basic_get_capture_type :: token_type token_type;

explicit basic_get_capture_as_string(token_type const& token,int capture_index = 1)
:get_capture_functor(token,capture_index)
{
}

template< typename Iterator,typename IdType,typename Context>
std :: basic_string< Char> operator()(Iterator& first,Iterator& last,lex :: pass_flags& flag,IdType& id,Context& context)
{
boost :: iterator_range< Iterator>常数& capture = get_capture_functor(first,last,flag,id,context);
return std :: basic_string< Char>(capture.begin(),capture.end());
}

private:
basic_get_capture_type get_capture_functor;
};

template< typename Attribute,typename Char,typename Idtype>
basic_get_capture_as_string< Attribute,Char,Idtype> get_capture_as_string(lex :: token_def< Attribute,Char,Idtype> const& token,int capture_index = 1)
{
return basic_get_capture_as_string< Attribute,Char,Idtype>(token,capture_index);
}

我们只是从更简单的操作返回的范围中创建一个 std :: basic_string



捕获



返回值的操作对我们没有什么用。终极目标是从捕获设置令牌值。这是通过最后一个动作完成的。

 模板< typename Attribute,typename Char,typename Idtype> 
class basic_set_val_from_capture
{
public:
typedef basic_get_capture_as_string< Attribute,Char,Idtype> basic_get_capture_as_string_type;
typedef typename basic_get_capture_as_string_type :: token_type token_type;

explicit basic_set_val_from_capture(token_type const& token,int capture_index = 1)
:get_capture_as_string_functor(token,capture_index)
{
}

template< typename Iterator,typename IdType,typename Context>
void operator()(Iterator& first,Iterator& last,lex :: pass_flags& flag,IdType& id,Context& context)
{
std :: basic_string&常数& capture = get_capture_as_string_functor(first,last,flag,id,context);
context.set_value(capture);
}

私人:
basic_get_capture_as_string_type get_capture_as_string_functor;
};

template< typename Attribute,typename Char,typename Idtype>
basic_set_val_from_capture< Attribute,Char,Idtype> set_val_from_capture(lex :: token_def< Attribute,Char,Idtype> const& token,int capture_index = 1)
{
return basic_set_val_from_capture< Attribute,Char,Idtype>(token,capture_index);
}



讨论



操作如下:

 变量[set_val_from_capture(variable)] 

您可以选择提供第二个参数作为要使用的捕获索引。

创建函数

p>

set_val_from_capture (或 get_capture_as_string get_capture 分别)是用于从 token_def 中自动推导模板参数的辅助函数。特别是我们需要的是 Char 类型来创建相应的正则表达式。



我不知道这个可以合理地避免,即使如此,它将显着复杂的调用操作符(特别是如果我们努力缓存正则表达式对象,而不是每次重新构建它)。我的疑问大多来自不确定 Char 类型 token_def 是否需要与标记化的序列字符相同类型。

动作的绝对不愉快的部分是需要提供令牌本身作为重复的参数。



然而, Char 如上所述的类型以获取正则表达式!



至少在理论上,我们可以在运行时基于 id 参数的动作(我们目前忽略的)中以运行时获得令牌。然而,我没有找到任何方法如何获得 token_def 基于令牌的标识符,无论是否从上下文参数或词法本身(可以通过创建函数 this )传递给操作。



可重用性



由于这些是动作,因此在更复杂的场景中它们不是真正可重用的(开箱即用)。例如,如果你不仅想获取捕获,而是将其转换为某个数值,那么你必须以这种方式写另一个动作,而不是在令牌上进行复杂的动作。



一开始我试图实现这样:

 变量[lex :: _ val = get_capture_as_string ] 

看起来更灵活,因为您可以轻松地在其周围添加更多代码 - 在一些转换函数中。



但我没能实现。虽然我觉得我没有努力足够。进一步了解 Boost.Phoenix 在这里肯定会有很大的帮助。



双重工作



这种解决方法并不会阻止我们进行双重工作。在正则表达式解析和匹配。但是如开头所述,似乎没有更好的方法(不改变Boost.Spirit本身)。


I'm writing a simple expressions parser. It is build on a Boost.Spirit.Qi grammar based on Boost.Spirit.Lex tokens (Boost in version 1.56).

The tokens are defined as follows:

using namespace boost::spirit;

template<
    typename lexer_t
>
struct tokens
    : lex::lexer<lexer_t>
{
    tokens()
        : /* ... */,
          variable("%(\\w+)")
    {
        this->self =
            /* ... */ |
            variable;
    }

    /* ... */
    lex::token_def<std::string> variable;
};

Now I would like the variable token value to be just the name (the matching group (\\w+)) without prefix % symbol. How do I do that?


Using a matching group by itself doesn't help. Still value is full string, including the prefix %.

Is there any way to force using of a matching group?

Or in at least somehow refer to it within action of the token?


I tried also using action like this:

variable[lex::_val = std::string(lex::_start + 1, lex::_end)]

but it failed to compile. Error claimed that none of the std::string constructor overloads could match arguments:

(const boost::phoenix::actor<Expr>, const boost::spirit::lex::_end_type)


Even simpler

variable[lex::_val = std::string(lex::_start, lex::_end)]

failed to compile. With similar reason only first argument type was now boost::spirit::lex::_start_type.


Finally I tried this (even though it looks like a big waste):

lex::_val = std::string(lex::_val).erase(0, 1)

but that also failed to compile. This time compiler was unable to convert from const boost::spirit::lex::_val_type to std::string.


Is there any way to deal with this problem?

解决方案

Simple Solution

Correct form of constructing the std::string attribute value is following:

variable[lex::_val = boost::phoenix::construct<std::string>(lex::_start + 1, lex::_end)]

exactly as suggested by jv_ in his (or her) comment.

boost::phoenix::construct is provided by <boost/phoenix/object/construct.hpp> header. Or use <boost/phoenix.hpp>.

Regular Expression Solution

The above solution however works well only in simple cases. And excludes the possibility to have the pattern provided from outside (from configuration data in particular). Since changing the pattern for example to %(\\w+)% would require to change the value construction code.

That is why it would be much better to be able to refer to capture groups from the regular expression defining the token.

Now note that this still isn't perfect since weird cases like %(\\w+)%(\\w+)% would still require change in the code to be handled correctly. That could be worked around by configuring not only the regex for the token but also means to form the value from the matched range. Yet this goes out of the scope of the question. Using capture groups directly seems flexible enough for many cases.

sehe in a comment elsewhere stated, that there is no way to use capture groups from token's regular expression. Not to mention that tokens actually support only a subset of regular expressions. (Among notable differences there is for example lack of support for naming capture groups or ignoring them!).

My own experiments in this area support that as well. There is no way to use capture groups sadly. There is a workaround however - you have to just re-apply the regex in your action.

Action Obtaining Capture Range

To make it a little bit modular let's start with a simplest task - an action which returns boost::iterator_range part of the token's match corresponding to specified capture.

template<typename Attribute, typename Char, typename Idtype>
class basic_get_capture
{
public:
    typedef lex::token_def<Attribute, Char, Idtype> token_type;
    typedef boost::basic_regex<Char> regex_type;

    explicit basic_get_capture(token_type const& token, int capture_index = 1)
        : token(token),
          regex(),
          capture_index(capture_index)
    {
    }

    template<typename Iterator, typename IdType, typename Context>
    boost::iterator_range<Iterator> operator ()(Iterator& first, Iterator& last, lex::pass_flags& /*flag*/, IdType& /*id*/, Context& /*context*/)
    {
        typedef boost::match_results<Iterator> match_results_type;

        match_results_type results;
        regex_match(first, last, results, get_regex());
        typename match_results_type::const_reference capture = results[capture_index];
        return boost::iterator_range<Iterator>(capture.first, capture.second);
    }

private:
    regex_type& get_regex()
    {
        if(regex.empty())
        {
            token_type::string_type const& regex_text = token.definition();
            regex.assign(regex_text);
        }
        return regex;
    }

    token_type const& token;
    regex_type regex;
    int capture_index;
};

template<typename Attribute, typename Char, typename Idtype>
basic_get_capture<Attribute, Char, Idtype> get_capture(lex::token_def<Attribute, Char, Idtype> const& token, int capture_index = 1)
{
    return basic_get_capture<Attribute, Char, Idtype>(token, capture_index);
}

The action uses Boost.Regex (include <boost/regex.hpp>).

Action Obtaining Capture as String

Now as the capture range is a nice thing to have as it doesn't allocate any new memory for the string, it is the string that we want in the end after all. So here another action build upon the previous one.

template<typename Attribute, typename Char, typename Idtype>
class basic_get_capture_as_string
{
public:
    typedef basic_get_capture<Attribute, Char, Idtype> basic_get_capture_type;
    typedef typename basic_get_capture_type::token_type token_type;

    explicit basic_get_capture_as_string(token_type const& token, int capture_index = 1)
        : get_capture_functor(token, capture_index)
    {
    }

    template<typename Iterator, typename IdType, typename Context>
    std::basic_string<Char> operator ()(Iterator& first, Iterator& last, lex::pass_flags& flag, IdType& id, Context& context)
    {
        boost::iterator_range<Iterator> const& capture = get_capture_functor(first, last, flag, id, context);
        return std::basic_string<Char>(capture.begin(), capture.end());
    }

private:
    basic_get_capture_type get_capture_functor;
};

template<typename Attribute, typename Char, typename Idtype>
basic_get_capture_as_string<Attribute, Char, Idtype> get_capture_as_string(lex::token_def<Attribute, Char, Idtype> const& token, int capture_index = 1)
{
    return basic_get_capture_as_string<Attribute, Char, Idtype>(token, capture_index);
}

No magic here. We just make an std::basic_string from the range returned by the simpler action.

Action Assigning Value From the Capture

Actions that return a value are of little use for us. Ultimate goal is to set token value from the capture. And this is done by the last action.

template<typename Attribute, typename Char, typename Idtype>
class basic_set_val_from_capture
{
public:
    typedef basic_get_capture_as_string<Attribute, Char, Idtype> basic_get_capture_as_string_type;
    typedef typename basic_get_capture_as_string_type::token_type token_type;

    explicit basic_set_val_from_capture(token_type const& token, int capture_index = 1)
        : get_capture_as_string_functor(token, capture_index)
    {
    }

    template<typename Iterator, typename IdType, typename Context>
    void operator ()(Iterator& first, Iterator& last, lex::pass_flags& flag, IdType& id, Context& context)
    {
        std::basic_string<Char> const& capture = get_capture_as_string_functor(first, last, flag, id, context);
        context.set_value(capture);
    }

private:
    basic_get_capture_as_string_type get_capture_as_string_functor;
};

template<typename Attribute, typename Char, typename Idtype>
basic_set_val_from_capture<Attribute, Char, Idtype> set_val_from_capture(lex::token_def<Attribute, Char, Idtype> const& token, int capture_index = 1)
{
    return basic_set_val_from_capture<Attribute, Char, Idtype>(token, capture_index);
}

Discussion

The actions are used like this:

variable[set_val_from_capture(variable)]

Optionally you can provide a second argument being the index of capture to use. It defaults to 1 which seems suitable in most cases.

Creating Functions

set_val_from_capture (or get_capture_as_string or get_capture respectively) is an auxiliary function used for automatic deduction of template arguments from the token_def. In particular what we need is the Char type to make corresponding regular expression.

I'm not sure if this could be reasonably avoided and even if so then it would significantly complicated the call operator (especially if we would strive to cache the regex object instead of building it each time anew). My doubts come mostly from not being sure whether Char type of token_def is required to be the same as the tokenized sequence character type or not. I assumed that they don't have to be the same.

Repeating the Token

Definitely unpleasant part of the action is the need to provide the token itself as an argument making a repetition.

The token is however needed for the Char type as described above and to... get the regular expression!

It seems to me that at least in theory we could be able to obtain the token somehow "at run-time" based on id argument to the action (which we just ignore currently). However I failed to find any way how to obtain token_def based on token's identifier regardless whether from context argument or the lexer itself (which could be passed to the action as this through creating function).

Reusability

Since those are actions they are not really reusable (out of the box) in more complex scenarios. For example if you would like to not only get just the capture but also convert it to some numeric value you would have to write another action this way instead of making a complex action at the token.

At first I tried to achieve something like this:

variable[lex::_val = get_capture_as_string(variable)]

It seems like more flexible as you could easily add more code around it - like for example wrap it in some conversion function.

But I failed to achieve it. Although I feel like I didn't try hard enough. Learning more about Boost.Phoenix would surely help here a lot.

Double Work

All this workaround doesn't prevent us from doing double work. Both at regex parsing and then matching. But as mentioned in the beginning it seems that there is no better way (without altering Boost.Spirit itself).

这篇关于如何使Boost.Spirit.Lex标记值为匹配序列的子字符串(最好是通过regex匹配组)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆