大写utf8字符与小写变体的字节数是否总是相同? [英] Are uppercase utf8 characters always the same number of bytes as their lowercase variants?
问题描述
对于拉丁字母来说显然是这样.但是我是从概念上跨语言和Unicode规范提出这个问题的.
Obviously it is true for the latin alphabet. But I'm asking this in a conceptual sense, across languages and the Unicode spec.
实际上是为了比较两个字符串而提出的.如果您已经知道它们在所有语言中的字节数不是相同的—您能考虑到足够的保证以确保它们不是相同字符串的大小写"版本不同吗?
Practically this came up for comparing two strings. If you already know they aren't the same number of bytes—across all languages—can you consider that enough of a guarantee that they are not differently "cased" versions of the same string?
推荐答案
否.
考虑U + 0069"i",它在UTF-8中具有八位字节值69
.以大写形式U + 0130İ",此代码点形成UTF-8序列C4 B0
.
Consider U+0069 "i" which has the octet value 69
in UTF-8. In the uppercase form U+0130 "İ" this code point forms the UTF-8 sequence C4 B0
.
强制性注释:区分大小写.
这篇关于大写utf8字符与小写变体的字节数是否总是相同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!