为什么没有std :: regex_traits的定义< char32_t> (因此没有std :: basic_regex< char32_t>)提供? [英] Why is there no definition for std::regex_traits<char32_t> (and thus no std::basic_regex<char32_t>) provided?

查看:134
本文介绍了为什么没有std :: regex_traits的定义< char32_t> (因此没有std :: basic_regex< char32_t>)提供?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在UTF-32编码点使用正则表达式,并发现此参考

a>说明std :: regex_traits必须由用户定义,以便可以使用std :: basic_regex。


  1. 为什么会这样呢?

    / li>
  2. 这与Unicode的组合代码点必须等于单代码点表示的事实有关(例如umlaut'ä'表示为单个代码点或者a和点是两个独立的)?


  3. 由于只支持单代码点字符的简化,



解决方案


  1. 正则表达式匹配的一些方面是语言环境感知的,结果是 std :: regex_traits 对象包含或引用一个实例a std :: locale 对象。 C ++标准库仅提供 char wchar_t 字符的语言环境,因此没有标准语言环境 char32_t (除非它恰好与 wchar_t 相同),并且此限制继承到正则表达式中。


  2. 您的描述不准确。 Unicode定义两个字符串之间的规范等价关系,这是基于使用NFC或NFD对两个字符串进行归一化,然后比较归一化值的码点。它没有将规范等价简单地定义为码点和码点序列之间的等价,因为规范化不能简单地逐个字符地完成。规范化可能需要将组成字符重新排序为规范顺序(在规范(de)合成之后)。因此,它不容易适应语言环境变换的C ++模型,这通常是单字符的。



    C ++标准库不实现任何Unicode标准化算法;在C ++中,和许多其他语言一样,两个字符串 L\\\ä(ä)和 L\\\a\\ \\ u0308(ä)将比较不同,尽管它们是典型的等效,并且像人类读者一样寻找同样的字形。 (在机器上我写这个答案,这两个字形的渲染是微妙的不同;如果你仔细观察,你会看到第二个的变音微微偏离它的视觉最佳位置,违反了Unicode规则等效字符串具有完全相同的渲染。)



    如果要检查两个字符串的规范等价,您需要使用Unicode规范化库。不幸的是,C ++标准库不包括任何这样的API;您可以查看 ICU (其中还包括 Unnicode-aware正则表达式匹配)。



    在任何情况下,正则表达式匹配 - C ++标准 - 不规范化目标字符串。这是由关于正则表达式的Unicode技术报告允许的,它建议将目标字符串显式标准化为某些标准化形式和使用规范化为该形式的字符串编写的模式:


    对于大多数功能齐全的正则表达式引擎,在典范等价下匹配,这可能涉及字符的重新排序,分割或合并。实际上,正则表达式API未设置为匹配字符的部分或处理不连续的选择。还有很多其他边缘情况…然而,构建将与NFD(或NFKD)文本匹配的模式是可行的。这可以通过以下方式完成:




    • 将要匹配的文本放入定义的规范化表单(NFD或NFKD)。

    • 使用户设计正则表达式模式以匹配定义的规范化形式。例如,模式应该不包含在该规范化形式中不会出现的字符,也不应包含不会出现的字符。

    • 通过代码点在代码点上应用匹配算法通常。



  3. 创建 char32_t 专门化 std :: regex_traits 将创建一个 char32_t 语言环境对象。我从来没有尝试过这些事情;







C ++标准对于正则表达式匹配的细节有些含糊,将细节留给了关于正则表达式的每种风格的外部文档(没有关于如何应用这种外部规范的完整解释)到除了每个风味被指定的字符类型之外)。然而,可以推断匹配是逐个字符的事实。例如,在& 28.3,Requirements [re.req],表136包括负责逐个字符等价算法的区域设置方法:


v.translate(c)
返回类型: X :: char_type
声明对于任何字符 d 被认为等效于 c 然后 v .translate(c)== v.translate(d)


默认的Modified ECMAScriptflavor(§ 28.13)的正则表达式匹配,标准描述了正则表达式引擎如何匹配两个字符(模式中的一个和目标中的一个):(第14.1节):


在正则表达式有限状态机与字符序列匹配期间,两个字符 c d 使用以下规则进行比较:


  1. if (flags()& regex_constants :: icase)如果 traits_inst.translate_nocase(c)== traits_inst.translate_nocase(d);


  2. 否则,如果 flags()&如果 traits_inst.translate(c)== traits_inst.translate(d);

    / li>
  3. 否则,如果 c == d ,则两个字符相等。




I would like to use regular expressions on UTF-32 codepoints and found this reference stating that std::regex_traits has to be defined by the user, so that std::basic_regex can be used at all. There seems to be no changes planned in the future for this.

  1. Why is this even the case?

  2. Does this have to do with the fact that Unicode says combined codepoint have to be treated equal to the single-code point representation (like the umlaut 'ä' represented as a single codepoint or with the a and the dots as two separate ones) ?

  3. Given the simplification that only single-codepoint characters would be supported, could this trait be defined easily or would this be either non-trivial nevertheless or require further limitations?

解决方案

  1. Some aspects of regex matching are locale-aware, with the result that a std::regex_traits object includes or references an instance of a std::locale object. The C++ standard library only provides locales for char and wchar_t characters, so there is no standard locale for char32_t (unless it happens to be the same as wchar_t), and this restriction carries over into regexes.

  2. Your description is imprecise. Unicode defines canonical equivalence relationship between two strings, which is based on normalizing the two strings, using either NFC or NFD, and then codepoint-by-codepoint comparing the normalized values. It does not defined canonical equivalence simply as an equivalence between a codepoint and a codepoint sequence, because normalization cannot simply be done character-by-character. Normalisation may require reordering composing characters into the canonical order (after canonical (de)composition). As such, it does not easily fit into the C++ model of locale transformations, which are generally single-character.

    The C++ standard library does not implement any Unicode normalization algorithm; in C++, as in many other languages, the two strings L"\u00e4" (ä) and L"\u0061\u0308" (ä) will compare as different, although they are canonically equivalent, and look to the human reader like the same grapheme. (On the machine I'm writing this answer, the rendering of those two graphemes is subtly different; if you look closely, you'll see that the umlaut in the second one is slightly displaced from its visually optimal position. That violates the Unicode requirement that canonically equivalent string have precisely the same rendering.)

    If you want to check for canonical equivalence of two strings, you need to use a Unicode normalisation library. Unfortunately, the C++ standard library does not include any such API; you could look at ICU (which also includes Unicode-aware regex matching).

    In any case, regular expression matching -- to the extent that it is specified in the C++ standard -- does not normalize the target string. This is permitted by the Unicode Technical Report on regular expressions, which recommends that the target string be explicitly normalized to some normalization form and the pattern written to work with strings normalized to that form:

    For most full-featured regular expression engines, it is quite difficult to match under canonical equivalence, which may involve reordering, splitting, or merging of characters.… In practice, regex APIs are not set up to match parts of characters or handle discontiguous selections. There are many other edge cases… It is feasible, however, to construct patterns that will match against NFD (or NFKD) text. That can be done by:

    • Putting the text to be matched into a defined normalization form (NFD or NFKD).
    • Having the user design the regular expression pattern to match against that defined normalization form. For example, the pattern should contain no characters that would not occur in that normalization form, nor sequences that would not occur.
    • Applying the matching algorithm on a code point by code point basis, as usual.

  3. The bulk of the work in creating a char32_t specialization of std::regex_traits would be creating a char32_t locale object. I've never tried doing either of these things; I suspect it would require a fair amount of attention to detail, because there are a lot of odd corner cases.


The C++ standard is somewhat vague about the details of regular expression matching, leaving the details to external documentation about each flavour of regular expression (and without a full explanation about how to apply such external specifications to character types other than the one each flavour is specified on). However, the fact that matching is character-by-character is possible to deduce. For example, in § 28.3, Requirements [re.req], Table 136 includes the locale method responsible for the character-by-character equivalence algorithm:

Expression: v.translate(c) Return type: X::char_type Assertion: Returns a character such that for any character d that is to be considered equivalent to c then v.translate(c) == v.translate(d).

Similarly, in the description of regular expression matching for the default "Modified ECMAScript" flavour (§ 28.13), the standard describes how the regular expression engine to matches two characters (one in the pattern and one in the target): (paragraph 14.1):

During matching of a regular expression finite state machine against a sequence of characters, two characters c and d are compared using the following rules:

  1. if (flags() & regex_constants::icase) the two characters are equal if traits_inst.translate_nocase(c) == traits_inst.translate_nocase(d);

  2. otherwise, if flags() & regex_constants::collate the two characters are equal if traits_inst.translate(c) == traits_inst.translate(d);

  3. otherwise, the two characters are equal if c == d.

这篇关于为什么没有std :: regex_traits的定义< char32_t> (因此没有std :: basic_regex< char32_t>)提供?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆