“编码”之间的区别是什么一个“字符集”,“和“代码页”? [英] What's the difference between an "encoding," a "character set," and a "code page"?

查看:195
本文介绍了“编码”之间的区别是什么一个“字符集”,“和“代码页”?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我真的想要改善这些东西。我很像这样的国际化概念,但是我需要在背后的理论上获得更好的背景。



我已经阅读了 Spolsky的文章,但我仍然不清楚,因为这三个术语可以互换使用,即使在那篇文章中。我认为至少有两个人正在谈论同样的事情。



我怀疑,高比例的开发人员每天都会流淌这些东西。我不想成为这些开发人员之一。

解决方案

一个字符集就是它所说的:正确指定的不同字符列表。



编码是字符集(通常为Unicode)和基于(通常基于字节的)技术表示之间的映射字符。



UTF-8是一种编码,但不是一个字符集。它是Unicode字符集(*)的编码。



混乱是因为大多数其他知名的编码(例如:ISO-8859-1)开始作为单独的字符集。然后,当Unicode作为大多数这些字符集的超集时,可以将它们视为相同(Unicode)字符集的不同(但是部分)编码,而不是仅仅是孤立的字符集。以这种方式看待他们,您可以通过Unicode轻松转换它们,如果它们仅仅是孤立的字符集,这是不可能的。但是将它们称为字符集仍然是有意义的,因此可以使用任何一个术语。



代码页是源于IBM的术语,它选择将显示哪组符号。该术语继续由DOS,然后Windows使用,通过Unicode感知Windows,它只是作为具有编号标识符的编码。虽然一个编号的代码页是一个没有固有地限于微软的想法,但今天这个术语几乎总是意味着Windows知道的编码。



当一个人在说话代码页<一些数字>一个通常涉及Windows特定编码,与标准体设计的编码不同。例如,代码页28591通常不会以此名称引用,而仅仅是ISO-8859-1。通常将基于ISO-8859-1的Windows特定西欧编码(包含一些替代其一些控制代码的字符)称为代码页1252。



[*:所有的UTF都是不是字符集的编码,但是这种东西并不排除Unicode。例如,日本标准JIS X 0208定义了一个字符集和两个不同的字节编码:一些令人不快的高字节编码('Shift-JIS'),以及可怕的基于逃逸切换的编码')。]


I'm really trying to get better with this stuff. I'm pretty functional with internationalization concepts like this, but I need to get a better background on the theory behind it.

I've read Spolsky's article, but I'm still unclear because these three terms get used interchangeably a LOT -- even in that article. I think at least two of them are talking about the same thing.

I suspect a high percentage of developers flub their way through this stuff on a daily basis. I don't want to be one of those developers anymore.

解决方案

A ‘character set’ is just what it says: a properly-specified list of distinct characters.

An ‘encoding’ is a mapping between a character set (typically Unicode today) and a (usually byte-based) technical representation of the characters.

UTF-8 is an encoding, but not a character set. It is an encoding of the Unicode character set(*).

The confusion comes about because most other well-known encodings (eg.: ISO-8859-1) started out as separate character sets. Then when Unicode came along as a superset of most of these character sets, it became possible to think of them as different (but partial) encodings of the same (Unicode) character set, rather than just isolated character sets. Looking at them this way allows you to convert between them through Unicode easily, which would not be possible if they were merely isolated character sets. But it still makes sense to refer to them as character sets, so either term could be used.

A ‘code page’ is a term stemming from IBM, where it chose which set of symbols would be displayed. The term continued to be used by DOS and then Windows, through to Unicode-aware Windows where it just acts as an encoding with a numbered identifier. Whilst a numbered ‘code page’ is an idea not inherently limited to Microsoft, today the term would almost always just mean an encoding that Windows knows about.

When one is talking of code page ‹some number› one is typically talking about a Windows-specific encoding, as distinct from an encoding devised by a standards body. For example code page 28591 would not normally be referred to under that name, but simply ‘ISO-8859-1’. The Windows-specific Western European encoding based on ISO-8859-1 (with a few extra characters replacing some of its control codes) would normally be referred to as ‘code page 1252’.

[*: All the UTFs are encodings not character sets, but this kind of thing isn't exclusive to Unicode. For example the Japanese standard JIS X 0208 defines a character set and two different byte encodings for it: the somewhat unpleasant high-byte-based encoding (‘Shift-JIS’), and the deeply horrific escape-switching-based encoding (‘JIS’).]

这篇关于“编码”之间的区别是什么一个“字符集”,“和“代码页”?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆