UTF-8字符的范围在C ++ 11正则表达式 [英] Range of UTF-8 Characters in C++11 Regex

查看:160
本文介绍了UTF-8字符的范围在C ++ 11正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题是 Do C ++ 11的扩展正则表达式使用UTF-8字符串?

  #include< regex& 
if(std :: regex_match(中,std :: regex(中)))//\\\中也可以工作
std :: cout< matched \\\
;

程序在Mac Mountain Lion上编译, clang ++ 使用以下选项:

  clang ++ -std = c ++ 0x -stdlib = libc ++ 

上面的代码工作。这是匹配任何日语汉字或汉字的标准范围regex [一 - 龠々〆ヵヶ]。它工作在JavaScript和Ruby,但我似乎不能得到范围工作在C + + 11,即使使用类似的版本 [\\\一-\\\龠] 。下面的代码与字符串不匹配。

  if(std :: regex_match(中,std :: regex [一 - 龠々〆ヵヶ])))
std :: cout< range matched\\\
;

更改区域设置也没有帮助。任何想法?



EDIT



所以我发现如果你添加 + 到结尾。在这种情况下, [一 - 龠々〆ヵヶ] + ,但如果添加 {1} [一 - 龠々〆ヵヶ] {1} 它不工作。此外,它似乎超越它的边界。它不匹配拉丁字符,但会匹配,它是 \\\は 这是 \\\ぁ 。他们都躺在 \\\一



nhahtdh还建议regex_search也可以不添加 + ,但它仍然遇到与上述相同的问题,拉值超出其范围。使用语言环境也有一些。 Mark Ransom建议它将UTF-8字符串视为一组无用的字节,我认为这可能是它正在做的事情。



进一步推动UTF-8正在越来越乱, [az] {1} [az] + 匹配 a ,但只有 [一 - 龠々〆ヵヶ] + 匹配任何字符,而不是 [一 -

以UTF-8编码,字符串<$

{1}

[一 - 龠々〆ヵヶ]等于此:[\ xe4\xb8\x80 -\xe9\ xbe \xa0\xe3\x80\x85\xe3\x80\x86\xe3\x83\xb5\xe3\x83\xb6]。这不是你正在寻找的 droid 字符类。



你正在寻找的字符类包括:




  • 范围U + 4E00..U + 9FA0中的任何字符;或

  • 任何字符々,〆,ヵ,ヶ。



字符类您指定的是包含以下内容的字符:




  • 任何字符\ xe4或\xb8;或

  • 范围\x80..\xe9中的任何字符;或

  • 任何字符\ xbe,\ xa0,\ xe3,\ x80,\ x85,\ xe3(再次),\ x80 ),\ x86,\ xe3(再次),\ x83,\ xb5,\ xe3(再次),\ x83(再次),\ xb6。



凌乱不是吗?你会看到这个问题吗?



这不会匹配拉丁字符(我假设你的意思是像az),因为在UTF-8,这些都使用单个字节



它不会匹配因为有三个字符,你的正则表达式只匹配那个奇怪的长列表中的一个字符。尝试 assert(std :: regex_match(中,std :: regex(...))),你会看到。



如果添加 + ,因为有三个



如果您改为添加 {1} ,那么您的正则表达式匹配一个或多个。 c>它不匹配,因为我们回来匹配三个字符对一个。



顺便说一下匹配,因为我们以相同的顺序将三个字符与相同的三个字符匹配。


$ b b

+ 的正则表达式实际上会匹配一些不需要的东西,因为它不关心顺序。可以从UTF-8中的该字节列表中创建的任何字符将匹配。它将匹配\xe3\x81\x81(ぁU + 3041),它甚至匹配无效的UTF-8输入,如 \xe3\xe3\xe3\xe3



更大的问题是,您使用的regex库甚至没有级别1支持Unicode,最低要求。它使用了一个硬编码的字符集来指定字符,并且没有太多的你宝贵的小正则表达式可以做的。



更大的问题是, 任何日本汉字或汉字。为什么不使用Unicode Script属性?



R(\p {Script = Han})



对,这不适用于C ++ 11正则表达式。有一会儿我几乎忘记了那些比Unicode无用的更糟糕。



那么你该怎么办?



您可以将输入解码为 std :: u32string 并使用 char32_t 完成匹配。这不会给你这个麻烦,但你仍然会硬编码范围和异常,当你的意思是一组字符共享一定的属性。



我推荐你忘记使用C ++ 11正则表达式,并使用一些具有最低级别1 Unicode支持的正则表达式库,例如 ICU 中的。 / p>


This question is an extension of Do C++11 regular expressions work with UTF-8 strings?

#include <regex>  
if (std::regex_match ("中", std::regex("中") ))  // "\u4e2d" also works
  std::cout << "matched\n";

The program is compiled on Mac Mountain Lion with clang++ with the following options:

clang++ -std=c++0x -stdlib=libc++

The code above works. This is a standard range regex "[一-龠々〆ヵヶ]" for matching any Japanese Kanji or Chinese character. It works in Javascript and Ruby, but I can't seem to get ranges working in C++11, even with using a similar version [\u4E00-\u9fa0]. The code below does not match the string.

if (std::regex_match ("中", std::regex("[一-龠々〆ヵヶ]")))
  std::cout << "range matched\n";

Changing locale hasn't helped either. Any ideas?

EDIT

So I have found that all ranges work if you add a + to the end. In this case [一-龠々〆ヵヶ]+, but if you add {1} [一-龠々〆ヵヶ]{1} it does not work. Moreover, it seems to overreach it's boundaries. It won't match latin characters, but it will match which is \u306f and which is \u3041. They both lie below \u4E00

nhahtdh also suggested regex_search which also works without adding + but it still runs into the same problem as above by pulling values outside of its range. Played with the locales a bit as well. Mark Ransom suggests it treats the UTF-8 string as a dumb set of bytes, I think this is possibly what it is doing.

Further pushing the theory that UTF-8 is getting jumbled some how, [a-z]{1} and [a-z]+ matches a, but only [一-龠々〆ヵヶ]+ matches any of the characters, not [一-龠々〆ヵヶ]{1}.

解决方案

Encoded in UTF-8, the string "[一-龠々〆ヵヶ]" is equal to this one: "[\xe4\xb8\x80-\xe9\xbe\xa0\xe3\x80\x85\xe3\x80\x86\xe3\x83\xb5\xe3\x83\xb6]". And this is not the droid character class you are looking for.

The character class you are looking for is the one that includes:

  • any character in the range U+4E00..U+9FA0; or
  • any of the characters 々, 〆, ヵ, ヶ.

The character class you specified is the one that includes:

  • any of the "characters" \xe4 or \xb8; or
  • any "character" in the range \x80..\xe9; or
  • any of the "characters" \xbe, \xa0, \xe3, \x80, \x85, \xe3 (again), \x80 (again), \x86, \xe3 (again), \x83, \xb5, \xe3 (again), \x83 (again), \xb6.

Messy isn't it? Do you see the problem?

This will not match "latin" characters (which I assume you mean things like a-z) because in UTF-8 those all use a single byte below 0x80, and none of those is in that messy character class.

It will not match "中" either because "中" has three "characters", and your regex matches only one "character" out of that weird long list. Try assert(std::regex_match("中", std::regex("..."))) and you will see.

If you add a + it works because "中" has three of those "characters" in your weird long list, and now your regex matches one or more.

If you instead add {1} it does not match because we are back to matching three "characters" against one.

Incidentally "中" matches "中" because we are matching the three "characters" against the same three "characters" in the same order.

That the regex with + will actually match some undesired things because it does not care about order. Any character that can be made from that list of bytes in UTF-8 will match. It will match "\xe3\x81\x81" (ぁ U+3041) and it will even match invalid UTF-8 input like "\xe3\xe3\xe3\xe3".

The bigger problem is that you are using a regex library that does not even have level 1 support for Unicode, the bare minimum required. It munges bytes and there isn't much your precious tiny regex can do about it.

And the even bigger problem is that you are using a hardcoded set of characters to specify "any Japanese Kanji or Chinese character". Why not use the Unicode Script property for that?

R"(\p{Script=Han})"

Oh right, this won't work with C++11 regexes. For a moment there I almost forgot those are annoyingly worse than useless with Unicode.

So what should you do?

You could decode your input into a std::u32string and use char32_t all over for the matching. That would not give you this mess, but you would still be hardcoding ranges and exceptions when you mean "a set of characters that share a certain property".

I recommend you forget about C++11 regexes and use some regular expression library that has the bare minimum level 1 Unicode support, like the one in ICU.

这篇关于UTF-8字符的范围在C ++ 11正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆