如何搭配的boost ::精神UNI code字? [英] How to match unicode characters with boost::spirit?

查看:118
本文介绍了如何搭配的boost ::精神UNI code字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我怎么能匹配UTF8 UNI code。使用字符的boost ::精神

How can I match utf8 unicode characters using boost::spirit?

例如,我要承认这个字符串的所有字符:

For example, I want to recognize all characters in this string:

$ echo "На берегу пустынных волн" | ./a.out
Н а б е р е гу п у с т ы н н ы х в о л н

当我试试这个简单的的boost ::精神程序将无法正常匹配UNI code字符:

When I try this simple boost::spirit program it will not match the unicode characters correctly:

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
#include <boost/foreach.hpp>
namespace qi = boost::spirit::qi;

int main() {
  std::cin.unsetf(std::ios::skipws);
  boost::spirit::istream_iterator begin(std::cin);
  boost::spirit::istream_iterator end;

  std::vector<char> letters;
  bool result = qi::phrase_parse(
      begin, end,  // input     
      +qi::char_,  // match every character
      qi::space,   // skip whitespace 
      letters);    // result    

  BOOST_FOREACH(char letter, letters) {
    std::cout << letter << " ";
  }
  std::cout << std::endl;
}

它的行为是这样的:

It behaves like this:

$ echo "На берегу пустынных волн" | ./a.out | less
<D0> <9D> <D0> <B0> <D0> <B1> <D0> <B5> <D1> <80> <D0> <B5> <D0> <B3> <D1> <83> <D0> <BF> <D1> <83> <D1> <81> <D1> <82> <D1> <8B> <D0> <BD> <D0> <BD> <D1> <8B> <D1> <85> <D0> 
<B2> <D0> <BE> <D0> <BB> <D0> <BD> 

更新:

好吧,我从事这个多一点,下面code为排序工作。它首先将输入的32位单code字一个迭代器(如建议这里):

Okay, I worked on this a bit more, and the following code is sort of working. It first converts the input into an iterator of 32-bit unicode characters (as recommended here):

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
#include <boost/foreach.hpp>
#include <boost/regex/pending/unicode_iterator.hpp>
namespace qi = boost::spirit::qi;

int main() {
  std::string str = "На берегу пустынных волн";
  boost::u8_to_u32_iterator<std::string::const_iterator>
      begin(str.begin()), end(str.end());
  typedef boost::uint32_t uchar; // a unicode code point
  std::vector<uchar> letters;
  bool result = qi::phrase_parse(
      begin, end,             // input
      +qi::standard_wide::char_,  // match every character
      qi::space,              // skip whitespace
      letters);               // result
  BOOST_FOREACH(uchar letter, letters) {
    std::cout << letter << " ";
  }
  std::cout << std::endl;
}

在code打印出统一code code点:

The code prints the Unicode code points:

$ ./a.out 
1053 1072 1073 1077 1088 1077 1075 1091 1087 1091 1089 1090 1099 1085 1085 1099 1093 1074 1086 1083 1085 

这似乎是正确的,根据官方统一code表

现在,谁能告诉我如何打印实际的字符,而不是给出的统一code code点这个载体?

Now, can anyone tell me how to print the actual characters instead, given this vector of Unicode code points?

推荐答案

我没有太多的经验与它,但显然灵(SVN主干版本)支持单向code。

I haven't got much experience with it, but apparently Spirit (SVN trunk version) supports Unicode.

#define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout

见,例如在 sexpr解析器样本 这是在该方案的演示。

See, e.g. the sexpr parser sample which is in the scheme demo.

BOOST_ROOT/libs/spirit/example/scheme

我相信这是基于布莱斯Lelbach 1 ,具体展示了从presentation演示:

I believe this is based on the demo from a presentation by Bryce Lelbach1, which specifically showcases:


  • WCHAR支持

  • utree属性(还是的实验)

  • S-EX pressions

有大约的 S-EX pressions和变异

1 在情况下,它确实是,这里的从presentation视频 幻灯片(PDF)这里找到(ODP )

1 In case it is indeed, here is the video from that presentation and the slides (pdf) as found here (odp)

这篇关于如何搭配的boost ::精神UNI code字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆