为什么在C11或C ++ 11中没有ASCII或UTF-8字符文字? [英] Why is there no ASCII or UTF-8 character literal in C11 or C++11?

查看:240
本文介绍了为什么在C11或C ++ 11中没有ASCII或UTF-8字符文字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么在C11或C ++ 11中没有UTF-8字符文字,即使有UTF-8字符串文字?我理解,一般来说,字符文字表示单个ASCII字符,它与单个八位字节的UTF-8代码点相同,但是C和C ++都不认为编码必须是ASCII。

Why is there no UTF-8 character literal in C11 or C++11 even though there are UTF-8 string literals? I understand that, generally-speaking, a character literal represents a single ASCII character which is identical to a single-octet UTF-8 code point, but neither C nor C++ says the encoding has to be ASCII.

基本上,如果我读标准权,不能保证'0'将代表整数0x30,而 u80必须表示char序列0x30 0x00。

Basically, if I read the standard right, there's no guarantee that '0' will represent the integer 0x30, yet u8"0" must represent the char sequence 0x30 0x00.

编辑:

我知道不是每个UTF-8代码点将适合一个字符。这样的文字只对单个八位字节的代码点(aka,ASCII)有用,因此我认为将其称为ASCII字符文字会更合适,因此问题仍然存在。我只是选择用UTF-8框架的问题,因为有UTF-8字符串文字。我可以想象可移植地保证ASCII值的唯一方法是为每个字符写一个常量,这不会那么糟糕,考虑只有128,但仍然...

I'm aware not every UTF-8 code point would fit in a char. Such a literal would only be useful for single-octet code points (aka, ASCII), so I guess calling it an "ASCII character literal" would be more fitting, so the question still stands. I just chose to frame the question with UTF-8 because there are UTF-8 string literals. The only way I can imagine portably guaranteeing ASCII values would be to write a constant for each character, which wouldn't be so bad considering there are only 128, but still...

推荐答案

写非可移植的C代码是完全可以接受的,这是很多好的原因之一。随意假设您的系统使用ASCII或其某些超集,并警告您的用户,他们不应该尝试在EBCDIC系统上运行您的程序。

It is perfectly acceptable to write non-portable C code, and this is one of many good reasons to do so. Feel free to assume that your system uses ASCII or some superset thereof, and warn your users that they shouldn't try to run your program on an EBCDIC system.

如果您感觉很慷ous,你可以编码检查。已知 gperf 程序会生成包含此类检查的代码。

If you are feeling very generous, you can encode a check. The gperf program is known to generate code that includes such a check.

_Static_assert('0' == 48, "must be ASCII-compatible");

或者,对于C11以前的编译器,

Or, for pre-C11 compilers,

extern int must_be_ascii_compatible['0' == 48 ? 1 : -1];

如果您使用C11,可以使用 u U 前缀,但不包括 u8 前缀...

If you are on C11, you can use the u or U prefix on character constants, but not the u8 prefix...

/* This is useless, doesn't do what you want... */
_Static_assert(0, "this code is broken everywhere");
if (c == '々') ...

/* This works as long as wchar_t is UTF-16 or UTF-32 or UCS-2... */
/* Note: you shouldn't be using wchar_t, though... */
_Static_assert(__STDC_ISO_10646__, "wchar_t must be some form of Unicode");
if (c == L'々') ...

/* This works as long as char16_t is UTF-16 or UCS-2... */
_Static_assert(__STDC_UTF_16__, "char16_t must be UTF-16");
if (c == u'々') ...

/* This works as long as char32_t is UTF-32... */
_Static_assert(__STDC_UTF_32__, "char32_t must be UTF-32");
if (c == U'々') ...

某些项目使用非常便携的C语言编写,并已移植到非ASCII系统(示例)。这需要一个不小的移植工作量,除非你知道你想在EBCDIC系统上运行你的代码,没有真正的理由去努力。

There are some projects that are written in very portable C and have been ported to non-ASCII systems (example). This required a non-trivial amount of porting effort, and there's no real reason to make the effort unless you know you want to run your code on EBCDIC systems.

标准:编写C标准的人员必须应对每一个可能的C实现,包括一些奇怪的。有已知的系统,其中 sizeof(char)== sizeof(long) CHAR_BIT!= 8 陷阱表示, sizeof(void *)!= sizeof(void *)!= sizeof(int *) ), va_list 是堆分配的等等。这是一场噩梦。

On standards: The people writing the C standard have to contend with every possible C implementation, including some downright bizarre ones. There are known systems where sizeof(char) == sizeof(long), CHAR_BIT != 8, integral types have trap representations, sizeof(void *) != sizeof(int *), sizeof(void *) != sizeof(void (*)()), va_list are heap-allocated, etc. It's a nightmare.

Don打败自己尝试编写代码,这些代码将在您从未听说过的系统上运行,并且不会在C标准中努力寻求保证。

Don't beat yourself up trying to write code that will run on systems you've never even heard of, and don't search to hard for guarantees in the C standard.

例如,就C标准而言,以下是 malloc 的有效实现:

For example, as far as the C standard is concerned, the following is a valid implementation of malloc:

void *malloc(void) { return NULL; }

请注意, u8... u... U...除了编码分别是每个字符16位和32位,并且实际编码必须通过实现记录以外,没有保证。

Note that while u8"..." constants are guaranteed to be UTF-8, u"..." and U"..." have no guarantees except that the encoding is 16-bits and 32-bits per character, respectively, and the actual encoding must be documented by the implementation.

摘要:可以在2012年安全使用ASCII兼容性。

Summary: Safe to assume ASCII compatibility in 2012.

这篇关于为什么在C11或C ++ 11中没有ASCII或UTF-8字符文字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆