ASCII-代码点与字符编码 [英] ASCII - code point vs. character encoding

查看:52
本文介绍了ASCII-代码点与字符编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现了一篇有趣的文章有关字符代码问题的教程"( http://jkorpela.fi/chars.html#code ),它解释了术语字符代码"/代码点"和字符编码".

I found an interesting article "A tutorial on character code issues" (http://jkorpela.fi/chars.html#code) which explains the terms "character code"/"code point" and "character encoding".

前者只是分配给字符的整数.例如从65到字符A.字符编码定义了如何通过一个或多个字节来表示这样的代码点.

The former is just an integer number which is assigned to an character. For example 65 to character A. The character encoding defines how such an code point is represented via one ore more bytes.

对于旧的ASCII,指导员说:"ASCII标准指定的字符编码非常简单,对于任何编码不超过255的字符代码,最明显的编码是:每个编码都表示为具有相同值的八位字节."

For the good old ASCII the autor says: "The character encoding specified by the ASCII standard is very simple, and the most obvious one for any character code where the code numbers do not exceed 255: each code number is presented as an octet with the same value. "

因此,A的代码点65将被编码为10000001.

So 65 which is the code point for A would be encoded as 1000 0001.

因为我有127个ASCII字符,所以有127个代码点,每个代码点始终由一个字节编码.

Because I have 127 characters in ASCII there are 127 code points where each code point is always encoded by one byte.

如果我对此进行总结,则可以执行以下步骤以ASCII编码字符:

If I summarize this I have the following steps to encode characters in ASCII:

  1. 为每个字符分配一个数字(代码点)(例如A-> 65)
  2. 使用具有相同值(例如1000 0001)的字节对字符进行编码

所以对于字母A和B来说应该是

So for the letter A and B it would be

A-> 65-> 1000 0001B-> 66-> 1000 0010

A -> 65 -> 1000 0001 B -> 66 -> 1000 0010

我的问题是:

为什么要分离编码点和ASCII编码?ASCII只有一种编码.因此,至少对于ASCII,我不清楚为什么要执行中间步骤(映射到整数).像

Why this separation of code points and encoding in ASCII? ASCII has only one encoding. So at least for ASCII it is not clear for me why the intermediate step (map to integer) is done. A direct encoding like

A-> 1000 0001B-> 1000 0010

A -> 1000 0001 B -> 1000 0010

是否也可能?如果我对ASCII字符有多种编码,则分隔是合理的,但仅采用一种编码形式对我来说就没有意义.

would also be possible or not? If I would have multiple encodings for an ASCII character the separation would be reasonable but with only one encoding form it doesn't make sense for me.

推荐答案

您是对的.每个概念不一定都需要针对特定​​编码的可识别实现.但是,在一般性地讨论字符集和编码时,最好区分所有概念.

You're right. Each concept doesn't necessarily require a discernable implementation for a particular encoding. But when discussing character sets and encodings in general, it's good to have all the concepts distinguished.

实际上,您可以考虑ASCII有两种编码,一种7位和一种8位.7位与在字节的第8位具有奇偶校验位的方案一起使用.Unicode以具有许多编码而著称,包括UTF-8,UTF-16和UTF-32.

Actually, you could consider ASCII to have two encodings, one 7-bit and one 8-bit. 7-bit was used along with a scheme that has a parity bit in the 8th bit of a byte. Unicode is notable for having many encodings, including UTF-8, UTF-16 and UTF-32.

缺少一个术语:代码单位.编码将代码点映射到一系列代码单元.代码单位是固定大小的整数.如您所知,大于8位的整数具有字节顺序(又称为字节序).这导致UTF-16和UTF-32具有大字节序和小字节序变体.

There is a missing term: Code Unit. An encoding maps a codepoint to a sequence of code units. Code units are integers of a fixed size. As you may know, integers larger than 8 bits have a byte ordering (aka endianness). This leads to UTF-16 and UTF-32 having big endian and little endian variants.

计算机化文本的基本规则:使用写入文件或流的编码进行读取.代表文本的字节必须附带编码知识,该知识来自声明,标准,约定,规范等.

Fundamental rule for computerized text: Read with the encoding that the file or stream was written with. Bytes that represent text must be accompanied by knowledge of the encoding, which comes from a declaration, standard, convention, specification, ….

ASCII中有128个代码点.大多数时候都提到ASCII,这是不正确的.要求提供说明ASCII的规范或进行更正.

There are 128 codepoints in ASCII. Most of the time ASCII is mentioned, it is not correct. Ask for the specification that says ASCII or for a correction.

这篇关于ASCII-代码点与字符编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆