在调用`mbtowc()`时,gcc如何确定宽字符集? [英] How does gcc decide the wide character set when calling `mbtowc()`?

查看:72
本文介绍了在调用`mbtowc()`时,gcc如何确定宽字符集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据gcc手册,选项 -fwide-exec-charset 指定编译时的宽字符串和字符常量的宽字符集.

According to the gcc manual, the option -fwide-exec-charset specifies the wide character set of wide string and character constants at compile time.

但是在运行时通过调用 mbtowc()将多字节字符转换为宽字符时,宽字符集是什么? POSIX标准表示已确定多字节字符的字符集按当前语言环境的LC_CTYPE类别分类,但是对于宽字符集则什么也没说.我现在没有C标准,所以我不知道C标准对此有何评论.

But what is the wide character set when converting a multi-byte character to a wide character by calling mbtowc() at run time? The POSIX standard says that the character set of multi-byte characters is determined by the LC_CTYPE category of the current locale, but says nothing about the wide character set. I don't have a C standard at hand now so I don't know what the C standard says about this.

与编译时一样,gcc选项 -fwide-exec-charset 是否确定 mbtowc()使用的宽字符集?

Does the gcc option -fwide-exec-charset determine the wide character set used by mbtowc(), just as it does at compile time?

推荐答案

简短答案:用于宽字符串的字符集由编译时已知的 wchar_t 的特征确定.由于 mbtowc 是一种库函数,因此在构建libc时会发生这种情况.

Short answer: the character set used for wide strings gets determined by the characteristics of wchar_t known at compile time. As mbtowc is a library function, this happens when libc is being built.

mbtowc 从外部字符集中编码的字符串中读取单个字符,并将其写到能够表示任何字符的 wchar_t 值中.同样, mbstowcs 将外部编码的C字符串转换为 wchar_t 的简单数组.从系统的角度来看,指定结果宽字符/字符串的字符集"是没有意义的,因为以任何方式更改其输出编码都会破坏结果宽字符串作为数组的使用.wchar_t .

mbtowc reads a single character from a string encoded in an external charset and writes it out to a wchar_t value able to represent any character. Likewise, mbstowcs converts an externally encoded C string into a simple array of wchar_t. From the system's point of view, it doesn't make sense to specify the "charset" of the resulting wide character/string, because changing its output encoding in any way would break the usage of the resulting wide string as array of wchar_t.

如果可以生成固定宽度的Unicode编码(例如UCS-2或UCS-4(或更确切地说是UTF-16或UTF-32)),则可以描述 mbstowcs 宽字符对应于ISO 10646代码点,并取决于 wchar_t 的宽度.您还可以根据处理器对 wchar_t 的表示的字节序来将其描述为little-endian或big-endian.但是这些是平台的属性,在运行时只能更改字节序或将ASCII更改为EBCDIC.

You can describe mbstowcs as producing fixed-width Unicode encodings such as UCS-2 or UCS-4 (or more precisely UTF-16 or UTF-32) if the wide chars correspond to ISO 10646 code points, and depending on the width of wchar_t. You can also describe it as little-endian or big-endian depending on your the endianness of the processor's representation of wchar_t. But those are properties of the platform, which you can't change at run-time any more than you can change endianness, or ASCII to EBCDIC.

-fwide-exec-charset 用于向编译器显式指定与array-of- wchar_t 的内部表示形式相对应的字符集.当它不同于编译器通常会生成的表示形式时,这很有用(因为您是交叉编译的,或者因为编译器配置错误).这就是手册继续警告您将遇到与 wchar_t 完全不同的编码的问题".

-fwide-exec-charset serves to explicitly specify to the compiler the charset that corresponds to the internal representation of array-of-wchar_t. This is useful when it differs from the representation the compiler would normally generate (because you are crosscompiling, or because the compiler was misconfigured). This is why the manual goes on to warn that "you will have problems with encodings that do not fit exactly in wchar_t."

这篇关于在调用`mbtowc()`时,gcc如何确定宽字符集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆