char16_t和char32_t字节序 [英] char16_t and char32_t endianness

查看:198
本文介绍了char16_t和char32_t字节序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在C11中,对可移植宽字符类型 char16_t char32_t 的支持为添加了分别用于UTF-16和UTF-32。



但是,在技术报告中,没有提到这两种类型的字节序。



例如,以下<使用 -std = c11 编译的x86_64计算机上的code> gcc-4.8.4 :

  #include< stdio.h> 
#include< uchar.h>

char16_t utf16_str [] = u十六; // U + 5341 U + 516D
unsigned char * chars =(unsigned char *)utf16_str;
printf(字节:%X%X%X%X\n,chars [0],chars [1],chars [2],chars [3]);

将产生

 字节:41 53 6D 51 

这意味着它是低位优先的。 p>

但是此行为是否依赖于平台/实现:它是否始终遵守平台的字节序,或者某些实现可能选择始终实现 char16_t big-endian中的c>和 char32_t

解决方案

char16_t char32_t 不保证Unicode编码。 (这是C ++功能。)宏 __ STDC_UTF_16 __ __ STDC_UTF_32 __ 分别表示Unicode代码点实际上确定了固定大小的字符值。有关这些宏,请参见C11§6.10.8.2。



(顺便说一下, __ STDC_ISO_10646 __ 表示 wchar_t ,它还揭示了通过 wchar_t 实现的Unicode版本,当然,实际上,编译器只是复制代码从源文件指向目标文件中的字符串,因此它不需要了解特定的字符。)



鉴于Unicode编码有效,因此代码存储在 char16_t char32_t 中的点值必须与 uint_least16_t 具有相同的对象表示形式c $ c>和 uint_least32_t ,因为它们分别被定义为这些类型的 typedef 别名(C11§7.28 )。这又与C ++稍有不同,后者使这些类型不同,但明确要求兼容的对象表示形式。



结果是,是的,<$ c没什么特别的$ c> char16_t char32_t 。它们是平台字节序的普通整数。



但是,您的测试程序与字节序无关。它只是使用宽字符的值,而无需检查它们如何映射到内存中的字节。


In C11, support for portable wide char types char16_t and char32_t are added for UTF-16 and UTF-32 respectively.

However, in the technical report, there is no mention of endianness for these two types.

For example, the following snippet in gcc-4.8.4 on my x86_64 computer when compiled with -std=c11:

#include <stdio.h>
#include <uchar.h>

char16_t utf16_str[] = u"十六";  // U+5341 U+516D
unsigned char *chars = (unsigned char *) utf16_str;
printf("Bytes: %X %X %X %X\n", chars[0], chars[1], chars[2], chars[3]);

will produce

Bytes: 41 53 6D 51

Which means that it's little-endian.

But is this behaviour platform/implementation dependent: does it always adhere to the platform's endianness or may some implementation choose to always implement char16_t and char32_t in big-endian?

解决方案

char16_t and char32_t do not guarantee Unicode encoding. (That is a C++ feature.) The macros __STDC_UTF_16__ and __STDC_UTF_32__, respectively, indicate that Unicode code points actually determine the fixed-size character values. See C11 §6.10.8.2 for these macros.

(By the way, __STDC_ISO_10646__ indicates the same thing for wchar_t, and it also reveals which Unicode edition is implemented via wchar_t. Of course, in practice, the compiler simply copies code points from the source file to strings in the object file, so it doesn't need to know much about particular characters.)

Given that Unicode encoding is in effect, code point values stored in char16_t or char32_t must have the same object representation as uint_least16_t and uint_least32_t, because they are defined to be typedef aliases to those types, respectively (C11 §7.28). This is again somewhat in contrast to C++, which makes those types distinct but explicitly requires compatible object representation.

The upshot is that yes, there is nothing special about char16_t and char32_t. They are ordinary integers in the platform's endianness.

However, your test program has nothing to do with endianness. It simply uses the values of the wide characters without inspecting how they map to bytes in memory.

这篇关于char16_t和char32_t字节序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆