用于C ++ 11中的字符串文字的Unicode编码 [英] Unicode encoding for string literals in C++11
问题描述
在相关问题之后,我想要在C ++ 11中询问新的字符和字符串类型。看来,我们现在有四种字符和五种字符串字面量。字符类型:
char a ='\x30'; // character,no semantics
wchar_t b = L'\xFFEF'; // wide character,no semantics
char16_t c = u'\\\ö'; // 16位,假定为UTF16?
char32_t d = U'\U0010FFFF'; // 32位,假设为UCS-4
字符串文字:
char A [] =Hello\x0A; // byte string,narrow encoding
wchar_t B [] = LHell\xF6\x0A; // width string,impl-def'd encoding
char16_t C [] = uHell\\\ö; //(1)
char32_t D [] = UHell\U000000F6\U0010FFFF; //(2)
auto E [] = u8\\\ö\U0010FFFF; //(3)
问题是: \x
/ \u
/ \U
字符引用可以与所有字符串类型自由组合是所有的字符串类型固定宽度,即数组包含精确的元素出现在文字,或 \x
/ \ u
/ \U
引用扩展为可变字节数?字符串具有编码语义,例如,字符串,字符串,字符串等。我可以说 char16_t x [] = u\U0010FFFF
,非BMP代码点编码成一个两单位UTF16序列?对于 u8
也类似?在(1)中,我可以用 \u
写独立的代理吗?最后,是任何编码感知的字符串函数(即它们是字符感知的并且可以检测无效的字节序列)?
这是一个开放式问题,但我想尽可能获得新的C ++ 11的新的UTF编码和类型设施的完整图片。
\x / \u / \U字符引用是否可以与所有字符串类型自由组合?
否。 \x
可用于任何内容,但 \u
和 \U
只能在经过UTF编码的字符串中使用。但是,对于任何UTF编码的字符串,可以使用 \u
和 \U
所有字符串类型都是fixed-width,即数组中包含的元素数量与文字中的元素数量完全相同, / \u / \U引用扩展成可变字节数?
不是你的意思。 \x
, \u
和 \U
根据字符串编码进行转换。这些代码单位的数量(使用Unicode术语。 char16_t
是一个UTF-16代码单位)值取决于包含字符串的编码。文本 u8\\\ဤ
将创建一个包含2 char
s加上一个空终止符的字符串。 u\\\ဤ
将创建一个包含1 char16_t
和一个空终止符的字符串。
所使用的代码单元数量基于Unicode编码。
Do u和u8字符串具有编码语义,例如我可以说char16_t x [] = u\U0010FFFF,非BMP代码点编码成一个两单位UTF16序列?
u
创建一个UTF-16编码的字符串。 u8
创建一个UTF-8编码的字符串。
绝对不是。规范明确禁止使用UTF-16代理对(0xD800-0xDFFF)作为 \u
或 \U
最后,是否有任何编码感知的字符串函数(即它们是字符感知的并且可以检测无效的字节序列)?
绝对不是。
std :: basic_string
不处理Unicode编码。他们当然可以存储 UTF编码的字符串。但它们只能将它们看作是 char
, char16_t
或 char32_t
;他们不能认为它们是用特定机制编码的Unicode码点序列。
basic_string :: length()
将返回代码单元的数量,而不是代码点。显然,C标准库字符串函数是完全无用的。
然而,应当注意,Unicode字符串的长度并不意味着代码点的数量。一些代码点结合了字符(一个不幸的名字),它与前面的代码点结合。因此,多个代码点可以映射到单个视觉字符。
Iostreams实际上可以读/写Unicode编码值。为此,您必须使用语言环境来指定编码并将其正确地插入到各个位置。这说起来容易做起来,我没有任何代码来告诉你如何。
Following a related question, I'd like to ask about the new character and string literal types in C++11. It seems that we now have four sorts of characters and five sorts of string literals. The character types:
char a = '\x30'; // character, no semantics
wchar_t b = L'\xFFEF'; // wide character, no semantics
char16_t c = u'\u00F6'; // 16-bit, assumed UTF16?
char32_t d = U'\U0010FFFF'; // 32-bit, assumed UCS-4
And the string literals:
char A[] = "Hello\x0A"; // byte string, "narrow encoding"
wchar_t B[] = L"Hell\xF6\x0A"; // wide string, impl-def'd encoding
char16_t C[] = u"Hell\u00F6"; // (1)
char32_t D[] = U"Hell\U000000F6\U0010FFFF"; // (2)
auto E[] = u8"\u00F6\U0010FFFF"; // (3)
The question is this: Are the \x
/\u
/\U
character references freely combinable with all string types? Are all the string types fixed-width, i.e. the arrays contain precisely as many elements as appear in the literal, or to \x
/\u
/\U
references get expanded into a variable number of bytes? Do u""
and u8""
strings have encoding semantics, e.g. can I say char16_t x[] = u"\U0010FFFF"
, and the non-BMP codepoint gets encoded into a two-unit UTF16 sequence? And similarly for u8
? In (1), can I write lone surrogates with \u
? Finally, are any of the string functions encoding aware (i.e. they are character-aware and can detect invalid byte sequences)?
This is a bit of an open-ended question, but I'd like to get as complete a picture as possible of the new UTF-encoding and type facilities of the new C++11.
Are the \x/\u/\U character references freely combinable with all string types?
No. \x
can be used in anything, but \u
and \U
can only be used in strings that are specifically UTF-encoded. However, for any UTF-encoded string, \u
and \U
can be used as you see fit.
Are all the string types fixed-width, i.e. the arrays contain precisely as many elements as appear in the literal, or to \x/\u/\U references get expanded into a variable number of bytes?
Not in the way you mean. \x
, \u
, and \U
are converted based on the string encoding. The number of those "code units" (using Unicode terms. A char16_t
is a UTF-16 code unit) values depends on the encoding of the containing string. The literal u8"\u1024"
would create a string containing 2 char
s plus a null terminator. The literal u"\u1024"
would create a string containing 1 char16_t
plus a null terminator.
The number of code units used is based on the Unicode encoding.
Do u"" and u8"" strings have encoding semantics, e.g. can I say char16_t x[] = u"\U0010FFFF", and the non-BMP codepoint gets encoded into a two-unit UTF16 sequence?
u""
creates a UTF-16 encoded string. u8""
creates a UTF-8 encoded string. They will be encoded per the Unicode specification.
In (1), can I write lone surrogates with \u?
Absolutely not. The specification expressly forbids using the UTF-16 surrogate pairs (0xD800-0xDFFF) as codepoints for \u
or \U
.
Finally, are any of the string functions encoding aware (i.e. they are character-aware and can detect invalid byte sequences)?
Absolutely not. Well, allow me to rephrase that.
std::basic_string
doesn't deal with Unicode encodings. They certainly can store UTF-encoded strings. But they can only think of them as sequences of char
, char16_t
, or char32_t
; they can't think of them as a sequence of Unicode codepoints that are encoded with a particular mechanism. basic_string::length()
will return the number of code units, not code points. And obviously, the C standard library string functions are totally useless
It should be noted however that "length" for a Unicode string does not mean the number of codepoints. Some code points are combining "characters" (an unfortunate name), which combine with the previous codepoint. So multiple codepoints can map to a single visual character.
Iostreams can in fact read/write Unicode-encoded values. To do so, you will have to use a locale to specify the encoding and properly imbue it into the various places. This is easier said than done, and I don't have any code on me to show you how.
这篇关于用于C ++ 11中的字符串文字的Unicode编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!