使用utf-32解析器在Boost.Spirit中处理utf-8 [英] Handling utf-8 in Boost.Spirit with utf-32 parser

查看:52
本文介绍了使用utf-32解析器在Boost.Spirit中处理utf-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有类似的问题,例如如何使用boost :: spirit解析UTF-8?如何将Unicode字符与boost进行匹配:: spirit?,但是这些都不能解决我面临的问题.我有一个带有UTF-8字符的 std :: string ,我用 u8_to_u32_iterator 包裹了 std :: string 并使用了unicode 终端,例如:

  BOOST_NETWORK_INLINE void parse_headers(std :: string const& input,std :: vector< request_header_narrow>& container){使用命名空间boost :: spirit :: qi;u8_to_u32_iterator< std :: string :: const_iterator>begin(input.begin()),end(input.end());std :: vector< request_header_narrow_utf8_wrapper>wrapper_container;解析(开始,结束,*(+(数字|(点-':'))>>lit(:")>>+((unicode :: alnum | space | punct)-'\ r'-'\ n')>>点亮("\ r \ n"))>>点亮("\ r \ n"),wrapper_container);BOOST_FOREACH(request_header_narrow_utf8_wrapper header_wrapper,wrapper_container){request_header_narrow标头;u32_to_u8_iterator< request_header_narrow_utf8_wrapper :: string_type :: iterator>name_begin(header_wrapper.name.begin()),name_end(header_wrapper.name.end()),value_begin(header_wrapper.value.begin()),value_end(header_wrapper.value.end());for(; name_begin!= name_end; ++ name_begin)header.name + = * name_begin;for(; value_begin!= value_end; ++ value_begin)header.value + = * value_begin;container.push_back(header);}} 

已定义 request_header_narrow_utf8_wrapper 并将其映射到Fusion(不必介意缺少名称空间声明):

  struct request_header_narrow_utf8_wrapper{typedef std :: basic_string< boost :: uint32_t>string_type;std :: basic_string< boost :: uint32_t>名称,值;};BOOST_FUSION_ADAPT_STRUCT(boost :: network :: http :: request_header_narrow_utf8_wrapper,(std :: basic_string< boost :: uint32_t>名称)(std :: basic_string< boost :: uint32_t> ;,值)) 

这很好用,但我想知道我是否可以设法使解析器直接关联到包含 std :: string 成员的结构,而不是使用进行for-each循环u32_to_u8_iterator 吗?我在想一种方法可以为std :: string封装一个包装,该包装具有 boost :: uint32_t 的赋值运算符,以便解析器可以直接进行赋值,但是还有其他解决方案吗?

编辑

阅读更多内容后,我得出以下结论:

  namespace boost {名称空间精神{名称空间特征{typedef std :: basic_string< uint32_t>u32_string;/*模板<>struct is_string< u32_string>:mpl :: true_ {}; */模板<>//< typename属性,typename T,typename Enable>struct Assign_to_container_from_value< std :: string,u32_string,void>{静态无效调用(u32_string const& val,std :: string& attr){u32_to_u8_iterator< u32_string :: const_iterator>begin(val.begin()),end(val.end());for(; begin!= end; ++ begin)attr + = *开始;}};}//命名空间特征}//命名空间精神}//命名空间提升 

还有这个

  BOOST_NETWORK_INLINE void parse_headers(std :: string const& input,std :: vector< request_header_narrow>& container){使用命名空间boost :: spirit :: qi;u8_to_u32_iterator< std :: string :: const_iterator>begin(input.begin()),end(input.end());解析(开始,结束,*(as< boost :: spirit :: traits :: u32_string>()[+(alum |(punct-':'))]]>>lit(:")>>as< boost :: spirit :: traits :: u32_string>()[+((unicode :: alnum | space | punct)-'\ r'-'\ n')]>>点亮("\ r \ n"))>>点亮("\ r \ n"), 容器);} 

如果这是我能得到的最好的建议或建议?

解决方案

属性特征的另一项工作.我已出于演示目的简化了您的数据类型:

  typedef std :: basic_string< uint32_t>u32_string;结构值{std :: string值;}; 

现在,您可以使用以下方法自动"进行转换:

  namespace boost {名称空间精神{名称空间特征{模板<>//< typename属性,typename T,typename Enable>struct Assign_to_attribute_from_value< Value,u32_string,void>{typedef u32_to_u8_iterator< u32_string :: const_iterator>转换静态无效调用(u32_string const& val,Value& attr){attr.value.assign(Conv(val.begin()),Conv(val.end()));}};}}} 

考虑一个示例解析器,该解析器解析UTF-8中的JSON样式的字符串,同时还允许32位代码点的Unicode转义序列: \ uXXXX .为此,将中间存储区设置为 u32_string 会很方便:

 ///////////////////////////////////////////////////////////////////解析器//////////////////////////////////////////////////////////////////命名空间qi = boost :: spirit :: qi;命名空间编码= qi :: standard_wide;//命名空间编码= qi :: unicode;template< typename It,typename Skipper = encoding :: space_type>struct parser:qi :: grammar< It,Value(),Skipper>{parser():parser :: base_type(开始){字符串= qi :: lexeme [L''> * * char_>> L'"'];静态qi :: uint_parser< uint32_t,16,4,4>_4HEXDIG;字符= +(〜encoding :: char_(L"\" \\))[qi :: _ val + = qi :: _ 1] |qi :: lit(L"\ x5C")>>(//\(反固相线)qi :: lit(L"\ x22")[qi :: _ val + = L''] |//"引号U + 0022qi :: lit(L"\ x5C")[qi :: _ val + = L'\\'] |//\反固相线U + 005Cqi :: lit(L"\ x2F")[qi :: _ val + = L'/'] |///固相线U + 002Fqi :: lit(L"\ x62")[qi :: _ val + = L'\ b'] |//b退格键U + 0008qi :: lit(L"\ x66")[qi :: _ val + = L'\ f'] |//f换页U + 000Cqi :: lit(L"\ x6E")[qi :: _ val + = L'\ n'] |//n个换行U + 000Aqi :: lit(L"\ x72")[qi :: _ val + = L'\ r'] |//r回车U + 000Dqi :: lit(L"\ x74")[qi :: _ val + = L'\ t'] |//t标签U + 0009qi :: lit(L"\ x75")//uXXXX U + XXXX>>_4HEXDIG [qi :: _ val + = qi :: _ 1]);//入口点开始=字符串;}私人的:qi :: rule< It,Value(),Skipper>开始;qi :: rule< It,u32_string()>细绳;qi :: rule< It,u32_string()>char_;}; 

如您所见, start 规则只是将属性值分配给 Value 结构-隐式调用我们的 assign_to_attribute_from_value 特性!/p>

一个简单的测试程序 在Coliru上直播 它确实起作用:

 //输入假定为utf8值解析(std :: string const& input){自动first(begin(input)),last(end(input));typedef boost :: u8_to_u32_iterator< decltype(first)>Conv2Utf32;Conv2Utf32 f(第一),保存= f,l(最后);静态const解析器< Conv2Utf32,编码:: space_type>;值解析;if(!qi :: phrase_parse(f,l,p,encoding :: space,已解析)){std :: cerr<<在位置#拍打"<<std :: distance(saved,f)<<"\ n";}返回解析}#include< iostream>int main(){解析的值= parse("\"脚注:¹严重的业务\\ u1e61 \ n \");std :: cout<<解析值;} 

现在请注意,输出再次以UTF8编码:

$ ./test |tee>(文件-)>(xxd)

 脚注:¹严重企业/dev/stdin:UTF-8 Unicode文本0000000:466f 6f74 6e6f 7465 3a20 c2b9 2073 6572脚注:.. ser0000010:696f 7573 2062 7573 696e 65c5 9fe1 b9a1企业.....0000020:0a 

U + 1E61代码点已正确编码为 [0xE1,0xB9,0xA1] .

I have a similar issue like How to use boost::spirit to parse UTF-8? and How to match unicode characters with boost::spirit? but none of these solve the issue i'm facing. I have a std::string with UTF-8 characters, i used the u8_to_u32_iterator to wrap the std::string and used unicode terminals like this:

BOOST_NETWORK_INLINE void parse_headers(std::string const & input, std::vector<request_header_narrow> & container) {
        using namespace boost::spirit::qi;
        u8_to_u32_iterator<std::string::const_iterator> begin(input.begin()), end(input.end());
        std::vector<request_header_narrow_utf8_wrapper> wrapper_container;
        parse(
            begin, end,
            *(
                +(alnum|(punct-':'))
                >> lit(": ")
                >> +((unicode::alnum|space|punct) - '\r' - '\n')
                >> lit("\r\n")
            )
            >> lit("\r\n")
            , wrapper_container
            );
        BOOST_FOREACH(request_header_narrow_utf8_wrapper header_wrapper, wrapper_container)
        {
            request_header_narrow header;
            u32_to_u8_iterator<request_header_narrow_utf8_wrapper::string_type::iterator> name_begin(header_wrapper.name.begin()),
                                                                                          name_end(header_wrapper.name.end()),
                                                                                          value_begin(header_wrapper.value.begin()),
                                                                                          value_end(header_wrapper.value.end());
            for(; name_begin != name_end; ++name_begin)
                header.name += *name_begin;
            for(; value_begin != value_end; ++value_begin)
                header.value += *value_begin;
            container.push_back(header);
       }
    }

The request_header_narrow_utf8_wrapper is defined and mapped to Fusion like this (don't mind the missing namespace declarations):

struct request_header_narrow_utf8_wrapper
{
    typedef std::basic_string<boost::uint32_t> string_type;
    std::basic_string<boost::uint32_t> name, value;
};

BOOST_FUSION_ADAPT_STRUCT(
    boost::network::http::request_header_narrow_utf8_wrapper,
    (std::basic_string<boost::uint32_t>, name)
    (std::basic_string<boost::uint32_t>, value)
    )

This works fine, but i was wondering can i somehow manage to make the parser assing directly to a struct containing std::string members instead of doing the for-each loop with the u32_to_u8_iterator ? I was thinking that one way could be making a wrapper for std::string that would have an assignment operator with boost::uint32_t so that parser could assign directly, but are there other solutions?

EDIT

After reading some more i ended up with this:

namespace boost { namespace spirit { namespace traits {

    typedef std::basic_string<uint32_t> u32_string;

   /* template <>
    struct is_string<u32_string> : mpl::true_ {};*/

    template <> // <typename Attrib, typename T, typename Enable>
    struct assign_to_container_from_value<std::string, u32_string, void>
    {
        static void call(u32_string const& val, std::string& attr) {
            u32_to_u8_iterator<u32_string::const_iterator> begin(val.begin()), end(val.end());
            for(; begin != end; ++begin)
                attr += *begin;
        }
    };

} // namespace traits

} // namespace spirit

} // namespace boost

and this

BOOST_NETWORK_INLINE void parse_headers(std::string const & input, std::vector<request_header_narrow> & container) {
        using namespace boost::spirit::qi;
        u8_to_u32_iterator<std::string::const_iterator> begin(input.begin()), end(input.end());
        parse(
            begin, end,
            *(
                as<boost::spirit::traits::u32_string>()[+(alnum|(punct-':'))]
                >> lit(": ")
                >> as<boost::spirit::traits::u32_string>()[+((unicode::alnum|space|punct) - '\r' - '\n')]
                >> lit("\r\n")
            )
            >> lit("\r\n")
            , container
            );
    }

Any comments or advice if this is the best i can get?

解决方案

Another job for a attribute trait. I've simplified your datatypes for demonstration purposes:

typedef std::basic_string<uint32_t> u32_string;

struct Value 
{
    std::string value;
};

Now you can have the conversion happen "auto-magically" using:

namespace boost { namespace spirit { namespace traits {
    template <> // <typename Attrib, typename T, typename Enable>
        struct assign_to_attribute_from_value<Value, u32_string, void>
        {
            typedef u32_to_u8_iterator<u32_string::const_iterator> Conv;

            static void call(u32_string const& val, Value& attr) {
                attr.value.assign(Conv(val.begin()), Conv(val.end()));
            }
        };
}}}

Consider a sample parser that parses JSON-style strings in UTF-8, while also allowing Unicode escape sequences of 32-bit codepoints: \uXXXX. It is convenient to have the intermediate storage be a u32_string for this purpose:

///////////////////////////////////////////////////////////////
// Parser
///////////////////////////////////////////////////////////////

namespace qi         = boost::spirit::qi;
namespace encoding   = qi::standard_wide;
//namespace encoding = qi::unicode;

template <typename It, typename Skipper = encoding::space_type>
    struct parser : qi::grammar<It, Value(), Skipper>
{
    parser() : parser::base_type(start)
    {
        string = qi::lexeme [ L'"' >> *char_ >> L'"' ];

        static qi::uint_parser<uint32_t, 16, 4, 4> _4HEXDIG;

        char_ = +(
                ~encoding::char_(L"\"\\")) [ qi::_val += qi::_1 ] |
                    qi::lit(L"\x5C") >> (                    // \ (reverse solidus)
                    qi::lit(L"\x22") [ qi::_val += L'"'  ] | // "    quotation mark  U+0022
                    qi::lit(L"\x5C") [ qi::_val += L'\\' ] | // \    reverse solidus U+005C
                    qi::lit(L"\x2F") [ qi::_val += L'/'  ] | // /    solidus         U+002F
                    qi::lit(L"\x62") [ qi::_val += L'\b' ] | // b    backspace       U+0008
                    qi::lit(L"\x66") [ qi::_val += L'\f' ] | // f    form feed       U+000C
                    qi::lit(L"\x6E") [ qi::_val += L'\n' ] | // n    line feed       U+000A
                    qi::lit(L"\x72") [ qi::_val += L'\r' ] | // r    carriage return U+000D
                    qi::lit(L"\x74") [ qi::_val += L'\t' ] | // t    tab             U+0009
                    qi::lit(L"\x75")                         // uXXXX                U+XXXX
                        >> _4HEXDIG [ qi::_val += qi::_1 ]
                );

        // entry point
        start = string;
    }

    private:
    qi::rule<It, Value(),  Skipper> start;
    qi::rule<It, u32_string()> string;
    qi::rule<It, u32_string()> char_;
};

As you can see, the start rule simply assigns the attribute value to the Value struct - which implicitely invokes our assign_to_attribute_from_value trait!

A simple test program Live on Coliru to prove that it does work:

// input assumed to be utf8
Value parse(std::string const& input) {
    auto first(begin(input)), last(end(input));

    typedef boost::u8_to_u32_iterator<decltype(first)> Conv2Utf32;
    Conv2Utf32 f(first), saved = f, l(last);

    static const parser<Conv2Utf32, encoding::space_type> p;

    Value parsed;
    if (!qi::phrase_parse(f, l, p, encoding::space, parsed))
    {
        std::cerr << "whoops at position #" << std::distance(saved, f) << "\n";
    }

    return parsed;
}

#include <iostream>

int main()
{
    Value parsed = parse("\"Footnote: ¹ serious busineş\\u1e61\n\"");
    std::cout << parsed.value;
}

Now observe that the output is encoded in UTF8 again:

$ ./test | tee >(file -) >(xxd)

Footnote: ¹ serious busineşṡ
/dev/stdin: UTF-8 Unicode text
0000000: 466f 6f74 6e6f 7465 3a20 c2b9 2073 6572  Footnote: .. ser
0000010: 696f 7573 2062 7573 696e 65c5 9fe1 b9a1  ious busine.....
0000020: 0a        

The U+1E61 code-point has been correctly encoded as [0xE1,0xB9,0xA1].

这篇关于使用utf-32解析器在Boost.Spirit中处理utf-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆