包含非再presentable字符三字母字符文字的含义 [英] Meaning of character literals containing trigraphs for non-representable characters

查看:261
本文介绍了包含非再presentable字符三字母字符文字的含义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在它使用ASCII作为它的字符集,字符的字面值'??<'C编译器将相当于的{,即0x7B。什么将是字面上对编译器的字符集值不的有无的一个 {字符?

外一个字符串,编译器可以推断, ??< 应该有相同的的作为一个开放的大括号字符被定义为具有,即使编译字符集不具有一个开括号字符。事实上,三字母的整个目的是允许使用再presentable字符序列代替未重新presentable字符的情况下使用。该规范要求三字母甚至字符串内处理然而,其中有我的疑惑。如果编译器的字符集包括 {字符,编译器可以让{重新presented为'??< ,但字符集包括 {我看不出有任何理由程序员不会简单地使用它。如果字符集不包括 {,但是,这似乎是唯一的理由摆在首位使用三字母,什么重新presentable字符将编译器是有望取代 ??<


解决方案

当涉及到对环境的考虑,特别是对文件时,C标准故意变得相当模糊。下面担保是由有关三字母及其相应的字符的编码​​:

C11(n1570)5.1.1.2 P1(翻译阶段)[EMPH。雷]


  

      
  1. 物理源文件多字节字符映射,以实现定义的方式,对源字符集(引入对尾线指标换行字符),如果必要的。的字母序列是由对应的单字符内部重新presentations 的替换。

  2.   

因此​​,该三字符序列必须被映射到一个字节。此单字节字符必须在基本性格从基本字符集中的任何其它字符集不同。编译器如何处理他们内部翻译过程中是不是真的观察到的行为,所以它是不相关的。

如果写入文本流它可以转化(如我读它,也许回三字符序列如果底层编码不具有​​一定的字符的编码​​)。它可以被再次读回,并且如果它被认为是一个打印字符必须比较相等。同上。 7.21.2 P2:


  

[...]数据从文本流中读取的必然比较等于先前被写入到该流仅当这些数据:数据只包含打印字符和控制字符水平制表符和换行;没有换行符立刻美元的空格字符pceded p $;而最后一个字符是换行字符。 [...]


同上。 7.4 P3:


  

术语打印字符是指一个特定区域集字符,每个字符占有的显示设备上的一个打印位置的一员;术语控制字符是指一个特定区域集未打印字符的字符中的一员。 *)所有字母和数字打印的字符。


  
  

*)在使用七比特US ASCII字符集的实现方式中,打印字符是那些其值从0x20的(空间)位于通过的0x7E(波浪号);控制字符是那些其值从0(NUL)位于到0x1F(美国),和字符0x7F的(DEL)。


和二进制流,同上。 7.21.2 P3:


  

一个二进制流是能够透明地记录内部数据中的字符的有序序列。数据从二进制流中读取在应比较等于较早前写出到该流的数据,相同的实现下。这种流可能,但是,已经追加到流的末尾空字符的实现定义的数量。


在上面的意见,问题出现了,如果

 的printf(INT主要(无效)??< ??> \\ n); //(1)
的printf(INT主要(无效)\\<????\\> \\ n); //(2)

总是适用于code生成和该语句的输出是保证编译。我无法找到一个规范性的参考,需要 isprint判断('??<')等。(为(1)),甚至 isprint判断('<')等(对于(2))返回非零值,但有关流的C89理由说:


  

要在文本流preserved I / O是那些需要编写C程序所需的字符集;其目的是标准应允许在最大时尚便携写入一个C转换器。不需要为此目的而控制字符比如退格,所以它们在文本流处理是不授权


'??< 等被写入到一个二进制流,它必须映射到一个单字节,印制这样,是唯一的,从区分任何其他的基本特征,并比较等于'??< 时,回读


相关报道: C89约三字母理

On a C compiler which uses ASCII as its character set, the value of the character literal '??<' would be equivalent to that of '{', i.e. 0x7B. What would be the value of that literal on a compiler whose character set doesn't have a { character?

Outside a string literal, a compiler could infer that ??< is supposed to have the same meaning as an open-brace character is defined to have, even if the compiler character set doesn't have an open-brace character. Indeed, the whole purpose of trigraphs is to allow the use of sequences of representable characters to be used in place of characters that aren't representable. The spec requires that trigraphs even be processed within string literals, however, which has me puzzled. If a compiler's character set includes a { character, the compiler can allow '{' to be represented as '??<', but the character set includes { I see no reason a programmer wouldn't simply use that. If the character set doesn't include {, however, which would seem the only reason for using trigraphs in the first place, what representable character would a compiler be expected to replace ??< with?

解决方案

When it comes to considerations about the environment, especially to files, the C standard intentionally becomes rather vague. The following guarantees are made about trigraphs and the encoding of their corresponding characters:

C11 (n1570) 5.1.1.2 p1 ("Translation phases") [emph. mine]

  1. Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.

Thus, the trigraph sequence must be mapped to a single byte. This single-byte character must be in the basic character set different from any other character in the basic character set. How the compiler handles them internally during translation isn’t really observable behaviour, so it’s irrelevant.

If written to a text stream it may be converted (as I read it, maybe back to a trigraph sequence if the underlying encoding doesn’t have an encoding for a certain character). It can be read back again, and must compare equal if it is considered a printing character. Ibid. 7.21.2 p2:

[…] Data read in from a text stream will necessarily compare equal to the data that were earlier written out to that stream only if: the data consist only of printing characters and the control characters horizontal tab and new-line; no new-line character is immediately preceded by space characters; and the last character is a new-line character. […]

Ibid. 7.4 p3:

The term printing character refers to a member of a locale-specific set of characters, each of which occupies one printing position on a display device; the term control character refers to a member of a locale-specific set of characters that are not printing characters.*) All letters and digits are printing characters.

*) In an implementation that uses the seven-bit US ASCII character set, the printing characters are those whose values lie from 0x20 (space) through 0x7E (tilde); the control characters are those whose values lie from 0 (NUL) through 0x1F (US), and the character 0x7F (DEL).

And for binary streams, ibid. 7.21.2 p3:

A binary stream is an ordered sequence of characters that can transparently record internal data. Data read in from a binary stream shall compare equal to the data that were earlier written out to that stream, under the same implementation. Such a stream may, however, have an implementation- defined number of null characters appended to the end of the stream.

In the comments above, the question arose if

printf("int main(void) ??< ??>\n");     // (1) 
printf("int main(void) ?\?< ?\?>\n");   // (2)

always works for code generation and the output of that statement is guaranteed to be compilable. I couldn’t find a normative reference requiring isprint('??<') etc. (for (1)) or even isprint('<') etc (for (2)) to return non-zero, but the C89 rationale about streams says:

The set of characters required to be preserved in text stream I/O are those needed for writing C programs; the intent is the Standard should permit a C translator to be written in a maximally portable fashion. Control characters such as backspace are not required for this purpose, so their handling in text streams is not mandated.

When '??<' etc. is written to a binary stream, it must map to a single byte, be printed as such, be unique and distinguishable from any other basic character, and compare equal to '??<' when read back.


Related: C89 rationale about trigraphs.

这篇关于包含非再presentable字符三字母字符文字的含义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆