char16_t和char32_t字节序 [英] char16_t and char32_t endianness
问题描述
在C11中,对可移植宽字符类型 char16_t
和 char32_t
的支持为添加了分别用于UTF-16和UTF-32。
但是,在技术报告中,没有提到这两种类型的字节序。
例如,以下<使用 -std = c11
编译的x86_64计算机上的code> gcc-4.8.4 :
#include< stdio.h>
#include< uchar.h>
char16_t utf16_str [] = u十六; // U + 5341 U + 516D
unsigned char * chars =(unsigned char *)utf16_str;
printf(字节:%X%X%X%X\n,chars [0],chars [1],chars [2],chars [3]);
将产生
字节:41 53 6D 51
这意味着它是低位优先的。 p>
但是此行为是否依赖于平台/实现:它是否始终遵守平台的字节序,或者某些实现可能选择始终实现 char16_t $ c $ big-endian中的c>和
char32_t
?
char16_t
和 char32_t
不保证Unicode编码。 (这是C ++功能。)宏 __ STDC_UTF_16 __
和 __ STDC_UTF_32 __
分别表示Unicode代码点实际上确定了固定大小的字符值。有关这些宏,请参见C11§6.10.8.2。
(顺便说一下, __ STDC_ISO_10646 __
表示 wchar_t
,它还揭示了通过 wchar_t
实现的Unicode版本,当然,实际上,编译器只是复制代码从源文件指向目标文件中的字符串,因此它不需要了解特定的字符。)
鉴于Unicode编码有效,因此代码存储在 char16_t
或 char32_t
中的点值必须与 uint_least16_t $ 具有相同的对象表示形式c $ c>和
uint_least32_t
,因为它们分别被定义为这些类型的 typedef
别名(C11§7.28 )。这又与C ++稍有不同,后者使这些类型不同,但明确要求兼容的对象表示形式。
结果是,是的,<$ c没什么特别的$ c> char16_t 和 char32_t
。它们是平台字节序的普通整数。
但是,您的测试程序与字节序无关。它只是使用宽字符的值,而无需检查它们如何映射到内存中的字节。
In C11, support for portable wide char types char16_t
and char32_t
are added for UTF-16 and UTF-32 respectively.
However, in the technical report, there is no mention of endianness for these two types.
For example, the following snippet in gcc-4.8.4
on my x86_64 computer when compiled with -std=c11
:
#include <stdio.h>
#include <uchar.h>
char16_t utf16_str[] = u"十六"; // U+5341 U+516D
unsigned char *chars = (unsigned char *) utf16_str;
printf("Bytes: %X %X %X %X\n", chars[0], chars[1], chars[2], chars[3]);
will produce
Bytes: 41 53 6D 51
Which means that it's little-endian.
But is this behaviour platform/implementation dependent: does it always adhere to the platform's endianness or may some implementation choose to always implement char16_t
and char32_t
in big-endian?
char16_t
and char32_t
do not guarantee Unicode encoding. (That is a C++ feature.) The macros __STDC_UTF_16__
and __STDC_UTF_32__
, respectively, indicate that Unicode code points actually determine the fixed-size character values. See C11 §6.10.8.2 for these macros.
(By the way, __STDC_ISO_10646__
indicates the same thing for wchar_t
, and it also reveals which Unicode edition is implemented via wchar_t
. Of course, in practice, the compiler simply copies code points from the source file to strings in the object file, so it doesn't need to know much about particular characters.)
Given that Unicode encoding is in effect, code point values stored in char16_t
or char32_t
must have the same object representation as uint_least16_t
and uint_least32_t
, because they are defined to be typedef
aliases to those types, respectively (C11 §7.28). This is again somewhat in contrast to C++, which makes those types distinct but explicitly requires compatible object representation.
The upshot is that yes, there is nothing special about char16_t
and char32_t
. They are ordinary integers in the platform's endianness.
However, your test program has nothing to do with endianness. It simply uses the values of the wide characters without inspecting how they map to bytes in memory.
这篇关于char16_t和char32_t字节序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!