非ASCII字符声明 [英] non-ASCII character declaration

查看:369
本文介绍了非ASCII字符声明的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用来存储一个字符(以它与其他字符比较)。

I would like to store a character (in order to compare it with other characters).

如果我宣布变量是这样的:

If I declare the variable like this :

char c = 'é';

一切正常,但我得到这些警告:

everything works well, but I get these warnings :

warning: multi-character character constant [-Wmultichar]
   char c = 'é';
            ^
ii.c:12:3: warning: overflow in implicit constant conversion [-Woverflow]
   char c = 'é';

我想我明白为什么有这些警告,但我不知道它为什么还能用吗?
我应该把它定义是这样的: INT D ='E'; 虽然它需要在内存中更多的空间?
此外,我还与这个声明得到如下警告:

I think I understand why there is these warnings, but I wonder why does it still work? And should I define it like this : int d = 'é'; although it takes more space in memory? Moreover, I also get the warning below with this declaration :

warning: multi-character character constant [-Wmultichar]

INT D ='E';

int d = 'é';

难道我错过了什么?谢谢;)

Do I miss something? Thanks ;)

推荐答案

电子的统一code code点0xE9时,UTF-8编码为\\ XC3版权所有\\ xA9

é has the Unicode code point 0xE9, the UTF-8 encoding is "\xc3\xa9".

我假设你的源文件为en为UTF-8 codeD,所以

I assume your source file is encoded in UTF-8, so

char c = 'é';

(大致),相当于

is (roughly) equivalent to

char c = '\xc3\xa9';

怎么这么字符常量将被视为是实现定义的。 对于GCC

编译器在每次计算一个多字符字符常量的字符,移由每个目标人物的位数离开了previous值,然后或-ING在截断新字符的位模式到的目标字符的宽度。最终位模式给出int类型,并因此被签名,而不管单个字符是否签署与否(从3.1版本的微小变化和更早的GCC)。如果有更多的字符的恒定比将适合在目标的int编译器将发出一个警告,和过量的领先字符被忽略。

The compiler evaluates a multi-character character constant a character at a time, shifting the previous value left by the number of bits per target character, and then or-ing in the bit-pattern of the new character truncated to the width of a target character. The final bit-pattern is given type int, and is therefore signed, regardless of whether single characters are signed or not (a slight change from versions 3.1 and earlier of GCC). If there are more characters in the constant than would fit in the target int the compiler issues a warning, and the excess leading characters are ignored.

例如,'AB'与8位字符的目标将是PTED为间$ P $(INT)((unsigned char型)'A'* 256 +(无符号字符) 'b')'\\ 234A(INT)((unsigned char型)'\\ 234'* 256 +(无符号字符)'A')

For example, 'ab' for a target with an 8-bit char would be interpreted as (int) ((unsigned char) 'a' * 256 + (unsigned char) 'b'), and '\234a' as (int) ((unsigned char) '\234' * 256 + (unsigned char) 'a').

因此​​,'E'的价值0xC3A9,它适合 INT (至少为32的位 INT ),而不是成为一个(8位)字符,所以转换为字符再次实现定义的

Hence, 'é' has the value 0xC3A9, which fits into an int (at least for 32-bit int), but not into an (8-bit) char, so the conversion to char is again implementation-defined:

有关转换的类型的宽度N的值减小模2 N 为内型的范围;没有信号时引发。

For conversion to a type of width N, the value is reduced modulo 2N to be within range of the type; no signal is raised.

这给了(有符号字符

#include <stdio.h>
int main(void) {
    printf("%d %d\n", 'é', (char)'é');
    if((char)'é' == (char)'©') puts("(char)'é' == (char)'©'");
}

输出:

50089 -87
(char)'é' == (char)'©'

50089是0xC3A9,87是0xA9。

50089 is 0xC3A9, 87 is 0xA9.

所以你存放时丢失信息电子字符(也有像字符©这比等于电子)。您可以

So you lose information when storing é into a char (there are characters like © which compare equal to é). You can


  1. 使用 wchar_t的,实现相关的宽字符类型,它是Linux 4字节持UTF-32: wchar_t的C = L'E' ; 。你可以将它们转换为多字节编码区域特定的(可能是UTF-8,但你需要之前设置语言环境,请参阅的setlocale ;注意,改变语言环境可以改变功能的行为像因而isalpha 的printf )通过 wcrtomb 或直接使用,也可以使用宽字符串(使用 preFIX获得宽字符字符串)

  2. 使用字符串和存储UTF-8在其中(如为const char * C =é; 为const char * C = \\ u00e9; 为const char * C =\\ XC3版权所有\\ xA9;,可能有不同的语义;对于C11,或许也期待为的 UTF-8字符串的和 U8 preFIX)

  1. Use wchar_t, an implementation-dependent wide character type which is 4 byte on Linux holding UTF-32: wchar_t c = L'é';. You can convert them to the locale-specific multibyte-encoding (probably UTF-8, but you'll need to set the locale before, see setlocale; note, that changing the locale may change the behaviour of functions like isalpha or printf) by wcrtomb or use them directly and also use wide strings (use the L prefix to get wide character string literals)
  2. Use a string and store UTF-8 in it (as in const char *c = "é"; or const char *c = "\u00e9"; or const char *c = "\xc3\xa9;", with possibly different semantics; for C11, perhaps also look for UTF-8 string literals and the u8 prefix)

请注意,该文件流有一个的定位的(参见 fwide )。

Note, that file streams have an orientation (cf. fwide).

心连心

这篇关于非ASCII字符声明的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆