如何比较“ basic_string”使用任意地点 [英] How to compare a "basic_string" using an arbitary locale

查看:98
本文介绍了如何比较“ basic_string”使用任意地点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要重新发布我今天早些时候提交的问题,但是现在我引用一个具体示例来回应收到的反馈。原始问题可以在此处(请注意,这不是家庭作业):



我只是想确定C ++是否无法执行(有效的)案例对 basic_string 对象的不敏感比较,该比较也将任意任意区域设置对象考虑在内。例如,似乎无法编写如下有效的函数:

  bool AreStringsEqualIgnoreCase(const string & str1,常量字符串& str2,常量语言环境& loc); 

根据我目前的理解(但有人可以确认这一点),该功能具有调用给定<$ c的 ctype :: toupper() collat​​e :: compare() $ c>区域设置(使用 use_facet()像往常一样提取)。但是,由于 collat​​e :: compare()特别需要4个指针args,因此您要么需要为需要比较的每个char传递这4个args(首先调用 ctype :: toupper()),或者先将两个字符串都转换为大写,然后再调用一次 collat​​e :: compare()



第一种方法显然效率低下(每个测试的char需要传递4个指针),第二种方法要求您将两个字符串全部转换为大写(要求分配内存,并且无需将两个字符串都复制/转换为大写)。我对此是否正确,也就是说,不可能高效地做到这一点(因为无法绕过 collat​​e :: compare())。

解决方案

尝试以一致的方式处理世界上所有书写系统的一个小烦恼是,实际上您认为关于字符的所有知识实际上都是正确的。这使得很难进行不区分大小写的比较之类的事情。确实,进行任何形式的区域感知比较都是很棘手的,而且不区分大小写也很棘手。



尽管有一些限制,但有可能实现。可以使用常规编程实践(和一些静态数据的预计算)有效地实现所需的算法,但是它不能像不正确的算法一样有效地实现。通常有可能在速度的正确性上取舍,但是结果并不令人满意。错误但快速的语言环境实现可能会吸引那些正确实施语言环境的用户,但是对于语言环境产生意外结果的部分受众来说显然不令人满意。



词汇



大多数语言环境( C语言环境除外)的大写语言已经区分大小写预期的,只有在考虑了所有其他差异之后才使用用例差异。也就是说,如果按照语言环境的排序顺序对单词列表进行排序,则列表中的单词仅在情况不同的情况下才是连续的。大写单词出现在小写单词之前还是之后是取决于语言环境的,但是中间不会有其他单词。



该结果无法实现通过任何单遍的从左到右的逐个字符比较(字典顺序)。而且大多数语言环境还有其他排序规则的怪癖,它们也不会屈从于简单的字典顺序。



标准C ++排序规则应该能够处理所有这些问题。适当的语言环境定义。但是,仅对成对的 whar_t 使用比较函数就不能将其简化为词典上的比较,因此C ++标准库不提供该接口。



以下仅是几个示例,说明了为什么支持区域设置的排序规则很复杂;在 Unicode技术标准10 中找到了更长的解释和更多示例。 / p>

重音符号在哪里?



大多数浪漫语言(还有英语,在处理借来的单词时)应将元音上的重音视为次要特征;也就是说,首先对单词进行排序,就好像没有重音一样,然后进行第二次遍历,在此过程中,未重读字母先于重读字母。处理案例需要第三遍,在前两遍中会忽略。



但这不适用于北欧语言。瑞典语,挪威语和丹麦语的字母表中有三个额外的元音,其后跟 z 。在瑞典语中,这些元音记为åäö。在挪威语和丹麦语中,这些字母分别写为åæø,在丹麦语中有时写为å写成 aa ,使奥胡斯成为丹麦城市字母顺序列表中的最后一个条目。



在德语中,字母äöü通常按字母顺序排列,并带有浪漫口音,但在德国电话簿(有时甚至是其他按字母顺序排列的列表)中,按字母顺序排列,就像写成 ae oe ue ,这是编写相同音素的较早样式。 (有许多对常见的姓氏,例如Müller和 Mueller发音相同,而且经常混淆,因此相互搭配是有意义的。当我年轻时,加拿大电话簿中的苏格兰名字也使用了类似的约定;拼写 M' Mc Mac



一个符号,两个字母。或两个字母,一个符号



德语中也有符号ß ,它被整理为好像写为< kbd> ss ,尽管在​​语音上并不完全相同。我们稍后会再遇到这个有趣的符号。



实际上,许多语言都将二字甚至三字母组合视为单个字母。 44个字母的匈牙利字母包括 Cs Dz Dzs Gy Ly Ny Sz Ty Zs 以及各种重音元音。但是,在1994年,有关这种现象的文章中最常引用的语言(西班牙语)停止将二字形词 ch ll 视为字母,大概是因为它更容易强制使用西班牙裔作家要遵循计算机系统,而不是改变计算机系统来处理西班牙有向图。 (维基百科声称这是来自联合国教科文组织和其他国际组织的压力;每个人花了相当长的时间才能接受新的字母顺序规则,并且您仍然偶尔会在南美国家的字母顺序列表中的哥伦比亚之后找到智利。)



摘要:比较字符串需要多次通过,有时还需要比较字符组



不区分大小写



由于语言环境在比较时正确处理了大小写,因此实际上没有必要对大小写不敏感订购。进行不区分大小写的等效类检查(相等测试)可能很有用,尽管这引发了一个问题,即其他不精确的等效类可能有用吗?在某些情况下,Unicode规范化,重音删除甚至转录为拉丁语都是合理的,而在另一些情况下则非常烦人。但是事实证明,大小写转换也不如您想像的那样简单。



由于存在二字和三字图,其中有些具有Unicode代码点, Unicode标准实际上可以识别三种情况,而不是两种情况:小写,大写和标题大小写。最后一个是用来将单词的第一个字母大写,例如克罗地亚语有向图&#x01c6; (U + 01C6;单个字符)需要用大写字母是&#x01c4; (U + 01C4),标题大小写是&#x01c5; (U + 01C5)。 不区分大小写比较的理论是,我们可以(至少在概念上)转换任何字符串,方式是将忽略大小写定义的等效类的所有成员都转换为相同的字节序列。传统上,这是通过大写字符串来完成的,但事实证明,这并不总是可能的,甚至是正确的;



C ++语言环境不太适合工作,Unicode标准更喜欢使用术语大小写折叠。 >



因此,回到C ++,可悲的事实是C ++语言环境没有足够的信息来进行准确的大小写折叠,因为C ++语言环境的工作原理是字符串的大小写折叠仅包含使用将代码点映射到另一个代码点的函数,依次且单独地对字符串中的每个代码点进行大写。正如我们将看到的那样,这是行不通的,因此,其效率问题无关紧要。另一方面, ICU库具有一个接口,该接口可以像Unicode一样正确地进行大小写折叠数据库允许,并且它的实现是由一些相当不错的编码人员精心设计的,因此在限制内它可能尽可能地高效。因此,我绝对建议您使用它。



如果您想全面了解案例折叠的难度,则应阅读 Unicode标准第5章的PDF )。以下只是几个示例。



大小写转换不是从单个字符到单个字符的映射



最简单的例子是德语ß (U + 00DF),它没有大写形式,因为它永远不会出现在单词的开头,是传统的德语拼字法没有全部使用大写字母。标准的大写转换是 SS (或在某些情况下是 SZ ),但是该转换是不可逆的;并非 ss 的所有实例都写为ß 。比较一下grüßen和küssen(分别是打招呼和亲吻)。在v5.1中,&#x1E9E;以U + 1E9E的形式在Unicode中添加了大写字母ß,但除了在法律上强制使用全大写字母的路牌以外,它并不常用。大写字母ß的期望是两个字母 SS



并非所有表意文字(可见字符)是单个字符代码



即使案例转换将单个字符映射到单个字符,它也可能无法表示为 wchar→ wchar 映射。例如,&#x01f0; 可以很容易地大写为 J&#x030c; ,但是前者是一个单独的组合字形(U + 01F0),而后者是一个带有组合卡通的大写字母J(U + 030C)。



还有一个问题象&#x01F0;

这样的字形

按字符大小写折叠的朴素字符会反规范化



假设我们将上面的大写&#x01f0; 设为大写。 &#x01f0;&#x0320; (如果在您的系统上无法正确呈现,则该字符是否是另一个字符,又是另一个IPA惯例,其下方带有一个长条)?该组合为U + 01F0,U + 0320(j带caron,下面组合负号),因此我们继续将U + 01F0替换为U + 004A,U + 030C,然后将U + 0320保留为: J&#x030c;&#x0320; 。很好,但是它不等于下面带有卡通和负号的归一化大写字母J,因为在正常形式中,负号变音符号首先出现:U + 004A,U + 0320,U + 030C( J& #x0320;&#x030c; ,看起来应该一样)。因此,有时候(说实话,很少,但是有时)有必要重新规范化。



撇开unicode的困惑,有时大小写转换是上下文相关的



希腊文提供了许多示例,这些示例说明了如何根据字首,字尾或字内的方式对商标进行随机排列-您可以在Unicode标准的第7章中了解有关此内容的更多信息-但一个简单且常见的情况是Σ ,它有两个小写版本:&#x03c3; &#x03c2; 。具有某些数学背景的非希腊人可能对σ 很熟悉,但是可能不知道不能在单词的结尾使用它,而必须在该词的末尾使用&#x03c2;



简而言之


  1. 最佳的正确案例折叠方法是应用Unicode案例折叠算法,该算法要求为每个源字符串创建一个临时字符串。然后,您可以在两个转换后的字符串之间进行简单的按字节比较,以验证原始字符串是否在相同的等效类中。在可能的情况下,对转换后的字符串执行排序规则排序要比对原始字符串进行排序规则效率要低得多,并且出于排序目的,未转换的比较可能与转换后的比较一样好或更好。


  2. 从理论上讲,如果您仅对大小写折叠的相等感兴趣,则可以线性地进行转换,请记住,转换不一定与上下文无关,也不是简单的字符-字符映射功能。不幸的是,C ++语言环境无法为您提供执行此操作所需的数据。 Unicode CLDR更加接近,但是它是一个复杂的数据结构。


  3. 所有这些东西确实非常复杂,并且充满了边缘情况。 (例如,请参见Unicode标准中有关重音立陶宛语 i 的注释。)最好只使用维护良好的现有解决方案,其中最好的例子是ICU。



I'm re-posting a question I submitted earlier today but I'm now citing a specific example in response to the feedback I received. The original question can be found here (note that it's not a homework assignment):

I'm simply trying to determine if C++ makes it impossible to perform an (efficient) case-INsensitive comparison of a basic_string object that also factors in any arbitrary locale object. For instance, it doesn't appear to be possible to write an efficient function such as the following:

bool AreStringsEqualIgnoreCase(const string &str1, const string &str2, const locale &loc);

Based on my current understanding (but can someone confirm this), this function has to call both ctype::toupper() and collate::compare() for the given locale (extracted as always using use_facet()). However, because collate::compare() in particular requires 4 pointer args, you either need to pass these 4 args for every char you need to compare (after first calling ctype::toupper()), or alternatively, convert both strings to upppercase first and then make a single call to collate::compare().

The 1st approach is obviously inefficient (4 pointers to pass for each char tested), and the 2nd requires you to convert both strings to uppercase in their entirety (requiring allocation of memory and needless copying/converting of both strings to uppercase). Am I correct about this, i.e., it's not possible to do it efficiently (because there's no way around collate::compare()).

解决方案

One of the little annoyances about trying to deal in a consistent way with all the world's writing systems is that practically nothing you think you know about characters is actually correct. This makes it tricky to do things like "case-insensitive comparison". Indeed, it is tricky to do any form of locale-aware comparison, and case-insensitivity is additionally thorny.

With some constraints, though, it is possible to accomplish. The algorithm needed can be implemented "efficiently" using normal programming practices (and precomputation of some static data), but it cannot be implemented as efficiently as an incorrect algorithm. It is often possible to trade off correctness for speed, but the results are not pleasant. Incorrect but fast locale implementations may appeal to those whose locales are implemented correctly, but are clearly unsatisfactory for the part of the audience whose locales produce unexpected results.

Lexicographical ordering doesn't work for human beings

Most locales (other than the "C" locale) for languages which have case already handle letter case in the manner expected, which is to use case differences only after all other differences have been taken into account. That is, if a list of words are sorted in the locale's collation order, then words in the list which differ only in case are going to be consecutive. Whether the words with upper case come before or after words with lower case is locale-dependent, but there won't be other words in between.

That result cannot be achieved by any single-pass left-to-right character-by-character comparison ("lexicographical ordering"). And most locales have other collation quirks which also don't yield to naïve lexicographical ordering.

Standard C++ collation should be able to deal with all of these issues, if you have appropriate locale definitions. But it cannot be reduced to lexicographical comparison just using a comparison function over pairs of whar_t, and consequently the C++ standard library doesn't provide that interface.

The following is just a few examples of why locale-aware collation is complicated; a longer explanation, with a lot more examples, is found in Unicode Technical Standard 10.

Where do the accents go?

Most romance languages (and also English, when dealing with borrowed words) consider accents over vowels to be a secondary characteristic; that is, words are first sorted as though the accents weren't present, and then a second pass is made in which unaccented letters come before accented letters. A third pass is necessary to deal with case, which is ignored in the first two passes.

But that doesn't work for Northern European languages. The alphabets of Swedish, Norwegian and Danish have three extra vowels, which follow z in the alphabet. In Swedish, these vowels are written å, ä, and ö; in Norwegian and Danish, these letters are written å, æ, and ø, and in Danish å is sometimes written aa, making Aarhus the last entry in an alphabetical list of Danish cities.

In German, the letters ä, ö, and ü are generally alphabetised as with romance accents, but in German phonebooks (and sometimes other alphabetical lists), they are alphabetised as though they were written ae, oe and ue, which is the older style of writing the same phonemes. (There are many pairs of common surnames such as "Müller" and "Mueller" are pronounced the same and are often confused, so it makes sense to intercollate them. A similar convention was used for Scottish names in Canadian phonebooks when I was young; the spellings M', Mc and Mac were all clumped together since they are all phonetically identical.)

One symbol, two letters. Or two letters, one symbol

German also has the symbol ß which is collated as though it were written out as ss, although it is not quite identical phonetically. We'll meet this interesting symbol again a bit later.

In fact, many languages consider digraphs and even trigraphs to be single letters. The 44-letter Hungarian alphabet includes Cs, Dz, Dzs, Gy, Ly, Ny, Sz, Ty, and Zs, as well as a variety of accented vowels. However, the language most commonly referenced in articles about this phenomenon -- Spanish -- stopped treating the digraphs ch and ll as letters in 1994, presumably because it was easier to force Hispanic writers to conform to computer systems than to change the computer systems to deal with Spanish digraphs. (Wikipedia claims it was pressure from "UNESCO and other international organizations"; it took quite a while for everyone to accept the new alphabetization rules, and you still occasionally find "Chile" after "Colombia" in alphabetical lists of South American countries.)

Summary: comparing character strings requires multiple passes, and sometimes requires comparing groups of characters

Making it all case-insensitive

Since locales handle case correctly in comparison, it should not really be necessary to do case-insensitive ordering. It might be useful to do case-insensitive equivalence-class checking ("equality" testing), although that raises the question of what other imprecise equivalence classes might be useful. Unicode normalization, accent deletion, and even transcription to latin are all reasonable in some contexts, and highly annoying in others. But it turns out that case conversions are not as simple as you might think, either.

Because of the existence of di- and trigraphs, some of which have Unicode codepoints, the Unicode standard actually recognizes three cases, not two: lower-case, upper-case and title-case. The last is what you use to upper case the first letter of a word, and it's needed, for example, for the Croatian digraph dž (U+01C6; a single character), whose uppercase is DŽ (U+01C4) and whose title case is Dž (U+01C5). The theory of "case-insensitive" comparison is that we could transform (at least conceptually) any string in such a way that all members of the equivalence class defined by "ignoring case" are transformed to the same byte sequence. Traditionally this is done by "upper-casing" the string, but it turns out that that is not always possible or even correct; the Unicode standard prefers the use of the term "case-folding", as do I.

C++ locales aren't quite up to the job

So, getting back to C++, the sad truth is that C++ locales do not have sufficient information to do accurate case-folding, because C++ locales work on the assumption that case-folding a string consists of nothing more than sequentially and individually upper-casing each codepoint in the string using a function which maps a codepoint to another codepoint. As we'll see, that just doesn't work, and consequently the question of its efficiency is irrelevant. On the other hand, the ICU library has an interface which does case-folding as correctly as the Unicode database allows, and its implementation has been crafted by some pretty good coders so it is probably just about as efficient as possible within the constraints. So I'd definitely recommend using it.

If you want a good overview of the difficulty of case-folding, you should read sections 5.18 and 5.19 of the Unicode standard (PDF for chapter 5). The following is just a few examples.

A case transform is not a mapping from single character to single character

The simplest example is the German ß (U+00DF), which has no upper-case form because it never appears at the beginning of a word, and traditional German orthography didn't use all-caps. The standard upper-case transform is SS (or in some cases SZ) but that transform is not reversible; not all instances of ss are written as ß. Compare, for example, grüßen and küssen (to greet and to kiss, respectively). In v5.1, ẞ, an "upper-case ß, was added to Unicode as U+1E9E, but it is not commonly used except in all-caps street signs, where its use is legally mandated. The normal expectation of upper-casing ß would be the two letters SS.

Not all ideographs (visible characters) are single character codes

Even when a case transform maps a single character to a single character, it may not be able to express that as a wchar→wchar mapping. For example, ǰ can easily be capitalized to , but the former is a single combined glyph (U+01F0), while the second is a capital J with a combining caron (U+030C).

There is a further problem with glyphs like ǰ:

Naive character by character case-folding can denormalize

Suppose we upper-case ǰ as above. How do we capitalize ǰ̠ (which, in case it doesn't render properly on your system, is the same character with an bar underneath, another IPA convention)? That combination is U+01F0,U+0320 (j with caron, combining minus sign below), so we proceed to replace U+01F0 with U+004A,U+030C and then leave the U+0320 as is: J̠̌. That's fine, but it won't compare equal to a normalized capital J with caron and minus sign below, because in the normal form the minus sign diacritic comes first: U+004A,U+0320,U+030C (J̠̌, which should look identical). So sometimes (rarely, to be honest, but sometimes) it is necessary to renormalize.

Leaving aside unicode wierdness, sometimes case-conversion is context-sensitive

Greek has a lot of examples of how marks get shuffled around depending on whether they are word-initial, word-final or word-interior -- you can read more about this in chapter 7 of the Unicode standard -- but a simple and common case is Σ, which has two lower-case versions: σ and ς. Non-greeks with some maths background are probably familiar with σ, but might not be aware that it cannot be used at the end of a word, where you must use ς.

In short

  1. The best available correct way to case-fold is to apply the Unicode case-folding algorithm, which requires creating a temporary string for each source string. You could then do a simple bytewise comparison between the two transformed strings in order to verify that the original strings were in the same equivalence class. Doing a collation ordering on the transformed strings, while possible, is rather less efficient than collation ordering the original strings, and for sorting purposes, the untransformed comparison is probably as good or better than the transformed comparison.

  2. In theory, if you are only interested in case-folded equality, you could do the transformations linearly, bearing in mind that the transformation is not necessarily context-free and is not a simple character-to-character mapping function. Unfortunately, C++ locales don't provide you the data you need to do this. The Unicode CLDR comes much closer, but it's a complex datastructure.

  3. All of this stuff is really complicated, and rife with edge cases. (See the note in the Unicode standard about accented Lithuanian i's, for example.) You're really better off just using a well-maintained existing solution, of which the best example is ICU.

这篇关于如何比较“ basic_string”使用任意地点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆