字符集 - 不清楚 [英] Character sets - Not clear

查看:161
本文介绍了字符集 - 不清楚的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

标准定义



-basic源字符集



- 基本执行字符集,对应方



它还定义了执行字符集,它的宽字符对应如下


$ 2.2 / 3-执行字符集
和执行宽字符集
是基本执行的超集
字符集和基本执行
wide-character执行
字符集的成员的
值是实现定义的
,任何
的附加成员都是
特定于语言环境的。


Q1。我不认为我完全理解这一点,特别是最后一句话。此方面的任何指针?



此外,


$ 3.9.1 - 被声明为
字符(char)的对象应该是大的
,足以存储
实现的基本字符集的任何成员。


Q2。在3.9.1中,基本字符集是指基本执行字符集?

解决方案

源字符集,执行字符集,线程执行字符集及其基本版本:



基本源字符集




$ b

这个字符集只有96个字符。他们适合7位。不包括 @ 的字符。



让我们来看一些基本源字符的二进制表示。它们可以是完全任意的,并且不需要它们对应于ASCII值。

  A  - > 0000000 
B - > 0100100
C - > 0011101

基本执行字元集


§2.1.3:基本执行字符集和基本执行宽字符集应包含基本源字符集的所有成员,加上控制字符警告,退格和回车,加上一个空字符(分别为空宽字符),其表示全部为零。


如上所述,基本执行字符集包含基本源字符集的所有成员。它仍然不包括任何其他字符,如 @ 。基本执行字符集可以有不同的二进制表示。



如上所述,基本执行字符集包含回车,空字符和其他字符的表示。

  A  - > 10110101010 
B - > 00001000101< - 基本源字符集
C - > 10101011111
--------------------------------------------- -------------
null - > 00000000000
Backspace - > 11111100011

如果基本执行字符集为11位长

$

b
$ b

基本执行宽字符用于宽字符(wchar_t)。它基本上与基本执行宽字符集相同,但也可以具有不同的二进制表示。

  A  - > 1011010101010110101010 
B - > 0000100010110101011111< - 基本源字符集
C - > 1010100101101000011011
--------------------------------------------- ------------------------
null - > 00000000000000000000000000
Backspace - > 1111110001100000000001

唯一的固定成员是空字符,需要是 0 位。



在基本字符集之间进行转换


$ b b


§2.1.1.5:字符文字和字符串文字中的每个源字符集成员,转义序列或通用字符名都将转换为执行字符集的成员(2.13。然后将一个c ++源文件编译为源字符集的每个字符,转换为基本执行的字符集(如图2所示) (宽)字符集。



示例:

  string0 =BA\bC; 
const wchar_t string1 = LBA\bC;

由于 string0 转换为基本执行字符集, string1 将转换为基本执行宽字符集。

  string0  - > 00001000101 10110101010 11111100011 10101011111 
string1 - > 0000100010110101011111 1011010101010110101010 //继续
1111110001100000000001 1010100101101000011011

关于文件编码的问题: strong>



有几种文件编码。例如,长度为7位的 ASCII Windows-1252 (长度为8位)(称为 ANSI )。
ASCII 不包含非英语字符。 ANSI 包含一些欧洲字符,例如äÖäÕø



较新的档案编码,例如 UTF-8 UTF-32 可以包含任何语言的字符。 UTF-8 是字符长度可变。 UTF-32 的长度为32位元。



档案包围要求: p>

大多数编译器提供命令行开关来指定源文件的文件编码。



被编码在具有基本源字符集的表示的文件编码中。例如:源文件的文件编码需要具有; 字符的表示。



可以在选择作为源文件的编码的编码中键入字符; c编码不适合作为c ++源文件编码。



非基本字符集



未包含在基本源字符集中的字符属于源字符集。源字符集等同于文件编码。



例如: @ 字符不包含在基本源字符,但它可以包括在源字符集中。所选择的输入源文件的文件编码可能包含 @ 的表示。如果它不包含 @ 的表示,则不能在字符串中使用字符 @



不包含在基本(宽)字符集中的字符属于执行(宽)字符集。



编译器将字符从源字符集转换为执行字符集和执行宽字符集。



例如:如果您指定 Windows-1252 作为源字符集的编码并指定 ASCII 作为执行宽字符集,则无法转换此字符串:

  const char * string0 =带欧洲字符的字符ö,Ä,ô,Ð。 

这些字符不能在 ASCII



指定字符集



字符集使用gcc。

  -finput-charset = UTF-8<  - 源字符集
-fexec -charset = UTF-8< - 执行字符集
-fwide-exec-charset = UTF-32< - 执行字符集

使用UTF-8和UTF-32作为默认编码c ++源文件可以包含任何语言的字符串。 UTF-8字符可以无问题地转换。



扩展字符集


§1.1.3:多字节字符,一个或多个字节的序列,表示源或执行环境的扩展字符集的成员。扩展字符集是基本字符集(2.2)的超集。


多字节字符比正常字符。它们包含将其标记为多字节字符的转义序列。



多字节字符根据用户运行时环境中设置的区域设置进行处理。这些多字节字符在运行时会转换为用户环境中的编码集。


The standard defines

-basic source character set

-basic execution character set and it's wide char counterpart

It also defines 'execution character set' and it's wide char counterpart as follows

$2.2/3- "The execution character set and the execution wide-character set are supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets are implementation-defined, and any additional members are locale-specific."

Q1. I don't think I understand this completely, particularly the last statement. Any pointers on this aspect?

Further,

$3.9.1 - "Objects declared as characters (char) shall be large enough to store any member of the implementation’s basic character set."

Q2. In 3.9.1 the phrase 'basic character set' means 'basic execution character set'?

解决方案

You need do distinguish between the source character set, the execution character set, the wire execution character set and it's basic versions:

The basic source character set:

§2.1.1: The basic source character set consists of 96 characters […]

This character set has exactly 96 characters. They fit into 7 bit. Characters like @ are not included.

Let's get some example binary representations for a few basic source characters. They can be completely arbitrary and there is no need these correspond to ASCII values.

A -> 0000000
B -> 0100100
C -> 0011101

The basic execution character set …

§2.1.3: The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits.

As stated the basic execution character set contains all members of basic source character set. It still doesn't include any other character like @. The basic execution character set can have a different binary representation.

As stated the basic execution character set contains representations for carriage return, a null character and other characters.

A          -> 10110101010
B          -> 00001000101    <- basic source character set
C          -> 10101011111
----------------------------------------------------------
null       -> 00000000000
Backspace  -> 11111100011

If the basic execution character set is 11 bits long (like in this example) the char data type shall be large enough to store 11 bits but it may be longer.

… and The basic execution wide character set:

The basic execution wide character is used for wide characters (wchar_t). It basicallly the same as the basic execution wide character set but can have different binary representations as well.

A          -> 1011010101010110101010
B          -> 0000100010110101011111    <- basic source character set
C          -> 1010100101101000011011
---------------------------------------------------------------------
null       -> 0000000000000000000000
Backspace  -> 1111110001100000000001

The only fixed member is the null character which needs to be a sequence of 0 bits.

Converting between basic character sets:

§2.1.1.5: Each source character set member, escape sequence, or universal-character-name in character literals and string literals is converted to a member of the execution character set (2.13.2, 2.13.4).

Then a c++ source file is compiled each character of the source character set is converted into the basic execution (wide) character set.

Example:

const char* string0   =  "BA\bC";
const wchar_t string1 = L"BA\bC";

Since string0 is a normal character it will be converted to the basic execution character set and string1 will be converted to the basic execution wide character set.

string0 -> 00001000101 10110101010 11111100011 10101011111
string1 -> 0000100010110101011111 1011010101010110101010    // continued
           1111110001100000000001 1010100101101000011011

Something about file encodings:

There are several kind of file encodings. For example ASCII which is 7 bit long. Windows-1252 which is 8 bit long (known as ANSI). ASCII doesn't contain non-English characters. ANSI contains some European characters like ä Ö ä Õ ø.

Newer file encodings like UTF-8 or UTF-32 can contain characters of any language. UTF-8 is characters are variable in length. UTF-32 are 32 bit characters long.

File enconding requirements:

Most compilers offer command line switch to specify the file encoding of the source file.

A c++ source file needs to be encoded in an file encoding which has a representation of the basic source character set. For example: The file encoding of the source file needs to have a representation of the ; character.

If you can type the character ; within the encoding chosen as the encoding of the source file that encoding is not suitable as a c++ source file encoding.

Non-basic character sets:

Characters not included in the basic source character set belong to the source character set. The source character set is equivalent to the file encoding.

For example: the @ character is not include in the basic source character but it may be included in the source character set. The chosen file encoding of the input source file might contain a representation of @. If it doesn't contain a representation for @ you can't use the character @ within strings.

Characters not included in the basic (wide) character set belong to the execution (wide) character set.

Remember that the compiler converts the character from the source character set to the execution character set and the execution wide character set. Therefore there needs to be way how these characters can be converted.

For example: If you specify Windows-1252 as the encoding of the source character set and specify ASCII as the execution wide character set there is no way to convert this string:

const char* string0 = "string with European characters ö, Ä, ô, Ð.";

These characters can not be represented in ASCII.

Specifying character sets:

Here are some examples how to specify the character sets using gcc. The default values are included.

-finput-charset=UTF-8         <- source character set
-fexec-charset=UTF-8          <- execution character set
-fwide-exec-charset=UTF-32    <- execution wide character set

With UTF-8 and UTF-32 as default encoding c++ source files can contain strings with character of any language. UTF-8 characters can the converted both ways without problems.

The extended character set:

§1.1.3: multibyte character, a sequence of one or more bytes representing a member of the extended character set of either the source or the execution environment. The extended character set is a superset of the basic character set (2.2).

Multibyte character are longer than an entry of the normal characters. They contain an escape sequence marking them as multibyte character.

Multibyte characters are processed according the locale set in the user's runtime environment. These multibyte characters are converted at runtime to the encoding set in user's environment.

这篇关于字符集 - 不清楚的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆