如何用Spirit :: lex摆脱令牌中的转义字符? [英] how to get rid of escape character in a token with spirit::lex?
问题描述
我想标记自己的SQL语法扩展.这涉及识别双引号字符串内的转义双引号.例如.在MySQL中,这两个字符串标记是等效的:""""
(第二个双引号用作转义符)和'"'
.我尝试了不同的方法,但是我仍然坚持如何替换令牌的值.
I want to tokenize my own extension of SQL syntax. This involves recognizing an escaped double quote inside a double quoted string. E.g. in MySQL these two string tokens are equivalent: """"
(the second double quote acts as an escape character) and '"'
. I have tried different things but I am stuck at how to replace a token's value.
#include <boost/spirit/include/lex_lexertl.hpp>
namespace lex = boost::spirit::lex;
template <typename Lexer>
struct sql_tokens : lex::lexer<Lexer>
{
sql_tokens()
{
string_quote_double = "\\\""; // '"'
this->self("INITIAL")
= string_quote_double [ lex::_state = "STRING_DOUBLE" ] // how to also ignore + ctx.more()?
| ...
;
this->self("STRING_DOUBLE")
= lex::token_def<>("[^\\\"]*") // action: ignore + ctx.more()
| lex::token_def<>("\\\"\\\"") // how to set token value to '"' ?
| lex::token_def<>("\\\"") [ lex::_state = "INITIAL" ]
;
}
lex::token_def<> string_quote_double, ...;
};
那么找到""
后如何将令牌的值设置为"
?
So how to set the token's value to "
when ""
has been found?
除此之外,我还有以下问题:我可以为函式动作编写函子,以呼叫ctx.more()并同时忽略权杖(因此将「低位」权杖组合成「高位」 字符串令牌).但是如何将其与lex :: _ state =".."完美地结合在一起?
Apart from that I have also the following question: I can write a functor for a semantic action to call ctx.more() and ignore the token at the same time (thus combining "low level" tokens into a "high level" string token). But how to elegantly combine this with lex::_state = ".." ?
推荐答案
编辑以回应评论,请参见下面的"UPDATE""
EDITED in response to comment, see below "UPDATE""
我建议不要尝试在词法分析器中解决该问题.让词法分析器生成原始字符串:
I suggest not trying to solve that in the lexer. Let the lexer yield raw strings:
template <typename Lexer>
struct mylexer_t : lex::lexer<Lexer>
{
mylexer_t()
{
string_quote_double = "\\\"([^\"]|\\\"\\\")*\\\"";
this->self("INITIAL")
= string_quote_double
| lex::token_def<>("[ \t\r\n]") [ lex::_pass = lex::pass_flags::pass_ignore ]
;
}
lex::token_def<std::string> string_quote_double;
};
注意像这样公开令牌属性,需要修改的令牌typedef:
NOTE That exposing a token attribute like that, requires a modified token typedef:
typedef lex::lexertl::token<char const*, boost::mpl::vector<char, std::string> > token_type;
typedef lex::lexertl::actor_lexer<token_type> lexer_type;
解析器中的后处理:
template <typename Iterator> struct mygrammar_t
: public qi::grammar<Iterator, std::vector<std::string>()>
{
typedef mygrammar_t<Iterator> This;
template <typename TokenDef>
mygrammar_t(TokenDef const& tok) : mygrammar_t::base_type(start)
{
using namespace qi;
string_quote_double %= tok.string_quote_double [ undoublequote ];
start = *string_quote_double;
BOOST_SPIRIT_DEBUG_NODES((start)(string_quote_double));
}
private:
qi::rule<Iterator, std::vector<std::string>()> start;
qi::rule<Iterator, std::string()> string_quote_double;
};
如您所见,undoubleqoute
可以是满足Spirit语义动作标准的任何Phoenix演员.一个死脑袋的示例实现将是:
As you can see, undoubleqoute
can be any Phoenix actor that satisfies the criteria for a Spirit semantic action. A brain-dead example implementation would be:
static bool undoublequote(std::string& val)
{
auto outidx = 0;
for(auto in = val.begin(); in!=val.end(); ++in) {
switch(*in) {
case '"':
if (++in == val.end()) { // eat the escape
// end of input reached
val.resize(outidx); // resize to effective chars
return true;
}
// fall through
default:
val[outidx++] = *in; // append the character
}
}
return false; // not ended with double quote as expected
}
但是我建议您编写一个适当的"转义符(因为我很确定MySql会允许\t
,\r
,\u001e
或什至是更多的过时的东西).
But I suggest you write a "proper" de-escaper (as I'm pretty sure MySql will allow \t
, \r
, \u001e
or even more archaic stuff as well).
我在这里的旧答案中有一些更完整的示例:
I have some more complete samples in old answers here:
- 待办事项
- 链接
- 这是一个搜索页面,其中包含许多相关答案使用Spirit
- TODO
- LINKS
- Here's a search page with many related answers using Spirit
实际上,正如您所指出的,将属性值规范化集成到词法分析器本身中相当容易:
In fact, as you indicated, it is fairly easy to integrate the attribute value normalization into the lexer itself:
template <typename Lexer>
struct mylexer_t : lex::lexer<Lexer>
{
struct undoublequote_lex_type {
template <typename, typename, typename, typename> struct result { typedef void type; };
template <typename It, typename IdType, typename pass_flag, typename Ctx>
void operator()(It& f, It& l, pass_flag& pass, IdType& id, Ctx& ctx) const {
std::string raw(f,l);
if (undoublequote(raw))
ctx.set_value(raw);
else
pass = lex::pass_flags::pass_fail;
}
} undoublequote_lex;
mylexer_t()
{
string_quote_double = "\\\"([^\"]|\\\"\\\")*\\\"";
const static undoublequote_lex_type undoublequote_lex;
this->self("INITIAL")
= string_quote_double [ undoublequote_lex ]
| lex::token_def<>("[ \t\r\n]") [ lex::_pass = lex::pass_flags::pass_ignore ]
;
}
lex::token_def<std::string> string_quote_double;
};
这重用了上面显示的相同undoublequote
函数,但是将其包装在满足
This reuses the same undoublequote
function shown above, but wraps it in Deferred Callable Object (or "polymorphic functor") undoublequote_lex_type
that satisfies the criteria for a Lexer Semantic Action.
这是一个完全可行的概念证明:
Here is a fully working proof of concept:
//#include <boost/config/warning_disable.hpp>
//#define BOOST_SPIRIT_DEBUG_PRINT_SOME 80
//#define BOOST_SPIRIT_DEBUG // before including Spirit
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/qi.hpp>
#include <fstream>
#ifdef MEMORY_MAPPED
# include <boost/iostreams/device/mapped_file.hpp>
#endif
//#include <boost/spirit/include/lex_generate_static_lexertl.hpp>
namespace /*anon*/
{
namespace phx=boost::phoenix;
namespace qi =boost::spirit::qi;
namespace lex=boost::spirit::lex;
template <typename Lexer>
struct mylexer_t : lex::lexer<Lexer>
{
mylexer_t()
{
string_quote_double = "\\\"([^\"]|\\\"\\\")*\\\"";
this->self("INITIAL")
= string_quote_double
| lex::token_def<>("[ \t\r\n]") [ lex::_pass = lex::pass_flags::pass_ignore ]
;
}
lex::token_def<std::string> string_quote_double;
};
static bool undoublequote(std::string& val)
{
auto outidx = 0;
for(auto in = val.begin(); in!=val.end(); ++in) {
switch(*in) {
case '"':
if (++in == val.end()) { // eat the escape
// end of input reached
val.resize(outidx); // resize to effective chars
return true;
}
// fall through
default:
val[outidx++] = *in; // append the character
}
}
return false; // not ended with double quote as expected
}
template <typename Iterator> struct mygrammar_t
: public qi::grammar<Iterator, std::vector<std::string>()>
{
typedef mygrammar_t<Iterator> This;
template <typename TokenDef>
mygrammar_t(TokenDef const& tok) : mygrammar_t::base_type(start)
{
using namespace qi;
string_quote_double %= tok.string_quote_double [ undoublequote ];
start = *string_quote_double;
BOOST_SPIRIT_DEBUG_NODES((start)(string_quote_double));
}
private:
qi::rule<Iterator, std::vector<std::string>()> start;
qi::rule<Iterator, std::string()> string_quote_double;
};
}
std::vector<std::string> do_test_parse(const std::string& v)
{
char const *first = &v[0];
char const *last = first+v.size();
typedef lex::lexertl::token<char const*, boost::mpl::vector<char, std::string> > token_type;
typedef lex::lexertl::actor_lexer<token_type> lexer_type;
typedef mylexer_t<lexer_type>::iterator_type iterator_type;
const static mylexer_t<lexer_type> mylexer;
const static mygrammar_t<iterator_type> parser(mylexer);
auto iter = mylexer.begin(first, last);
auto end = mylexer.end();
std::vector<std::string> data;
bool r = qi::parse(iter, end, parser, data);
r = r && (iter == end);
if (!r)
std::cerr << "parsing (" << iter->state() << ") failed at: '" << std::string(first, last) << "'\n";
return data;
}
int main(int argc, const char *argv[])
{
for (auto&& s : do_test_parse( "\"bla\"\"blo\""))
std::cout << s << std::endl;
}
这篇关于如何用Spirit :: lex摆脱令牌中的转义字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!