Windows-1252(1/3/4) 和 ISO-8859-1 之间的确切区别是什么? [英] What is the exact difference between Windows-1252(1/3/4) and ISO-8859-1?

查看:20
本文介绍了Windows-1252(1/3/4) 和 ISO-8859-1 之间的确切区别是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在基于 Debian 的 LAMP 安装上托管 PHP 应用程序.一切都很好 - 性能,行政和管理明智.然而,作为一个有点新的开发者(我们还在上高中),我们在西方字符集的字符编码方面遇到了一些问题.

经过大量研究,我得出的结论是,网上的信息有些混乱.它谈论的是 Windows-1252 是 ANSI 并且完全兼容 ISO-8859-1.

无论如何,Windows-1252(1/3/4) 和 ISO-8859-1 有什么区别?无论如何,ANSI 是从哪里来的?

我们应该在 Debian 服务器(和工作站)上使用什么编码,以确保客户端以预期的方式获取所有信息并且我们不会在此过程中丢失任何字符?

解决方案

我想以一种更像网络的方式来回答这个问题,为了回答它,所以我们需要一点历史.Joel Spolsky 写了一篇非常很好的介绍文章,关于每个开发人员应该知道的关于 Unicode 字符编码的绝对最低要求.请耐心等待,因为这将是一个 looong 答案.:)

作为历史,我将引用那里的一些引言:(非常感谢乔尔!:))

<块引用>

唯一重要的字符是古老的无重音英文字母,我们为它们编写了一个称为 ASCII 的代码,它能够使用 32 到 127 之间的数字来表示每个字符.空格是 32,字母A"是 65等.这可以方便地存储在 7 位中.那个时代的大多数计算机都使用 8 位字节,因此您不仅可以存储所有可能的 ASCII 字符,而且您还有一整块空闲空间,如果您是邪恶的,您可以将其用于自己的狡猾目的.

一切都很好,假设您会说英语.因为字节最多可容纳 8 位,所以很多人开始思考,天哪,我们可以将 128-255 代码用于我们自己的目的."问题是,很多人同时有这个想法,他们对从 128 到 255 的空间中应该去哪里有自己的想法.

所以现在OEM 字符集"随 PC 一起分发,但它们仍然不同且不兼容.令我们同时惊讶的是 - 一切都很好!他们没有互联网,人们很少在不同语言环境的系统之间交换文件.

乔尔继续说:

<块引用>

事实上,当人们开始在美国以外的地方购买 PC 时,就想出了各种不同的 OEM 字符集,它们都出于自己的目的使用前 128 个字符.最终,这个 OEM 混战在 ANSI 标准中得到了编纂.在 ANSI 标准中,每个人都同意在 128 以下做什么,这与 ASCII 几乎相同,但是有很多不同的方法来处理 128 及以上的字符,这取决于你住在哪里.这些不同的系统被称为代码页.

这就是Windows 代码页"的最终诞生方式.它们实际上是由 DOS 代码页父"的.然后Unicode诞生了!:) 和 UTF-8 是另一个用于存储 Unicode 代码点字符串的系统"实际上从 0 到 127 的每个代码点都存储在一个字节中"并且与 ASCII.我不会再详细介绍 Unicode 和 UTF-8,但您应该阅读 BOMEndianness字符编码作为一般.

关于ANSI 阴谋",微软实际上承认了 Windows-1252术语表中:

<块引用>

所谓的 Windows 字符集(WinLatin1,或 Windows 代码页 1252,确切地说)使用其中一些位置作为可打印字符.因此,Windows 字符集与 ISO 8859-1 不同.Windows 字符集通常称为ANSI 字符集",但这是严重误导.它尚未得到 ANSI 的批准.

因此,ANSI 在引用 Windows 字符集时未经过 ANSI 认证!:)

正如 Jukka 指出的(感谢您的回答)

<块引用>

Windows-1252 ISO Latin 1,也称为 ISO-8859-1 作为字符编码,因此代码范围 0x80 到 0x9F 是为 ISO-8859-1(所谓的 C1 Controls)中的控制字符保留的,在 Windows-1252 中,其中一些代码分配给可打印字符(主要是标点符号),其他代码未定义.

然而,我个人的观点和技术理解是 Windows-1252 和 ISO-8859-1 都不是网络编码!:) 所以:

  • 对于网页,请使用 UTF-8 作为内容编码因此,将数据存储为 UTF-8 并使用 HTTP 标头吐出":<代码>内容类型:文本/html;charset=utf-8.

    还有一种叫做 HTML 内容类型元标记的东西:<代码><头><meta http-equiv="Content-Type" content="text/html; charset=utf-8">现在,当浏览器遇到这个标签时,它们实际上是从 HTML 文档的开头重新开始,以便它们可以以声明的编码重新解释文档.仅当没有内容类型"标题时才会发生这种情况.

  • 如果您的系统用户需要从中生成文件,请使用其他特定编码.例如,一些西方用户可能需要 Excel 生成的文件,或 Windows-1252 中的 CSV.如果是这种情况,请在该语言环境中对文本进行编码,然后将其存储在 fs 中并将其作为可下载的文件提供.

  • 在 HTTP 的设计中还有一点需要注意:内容编码分发机制应该是这样工作的.

    I. 客户端通过以下方式请求特定内容类型和编码的网页:接受"和接受字符集"请求标头.

    II. 然后服务器(或 Web 应用程序)返回转码为该编码和字符集的内容.

在大多数现代网络应用中,情况并非如此.Web 应用程序以 UTF-8 格式提供(强制客户端)内容时实际发生了什么.这是有效的,因为浏览器会根据响应标头而不是它们实际预期的内容来解释接收到的文档.

我们都应该使用 Unicode,所以请尽可能使用 UTF-8 来分发您的内容,最重要的是,请使用 UTF-8.否则互联网长老会困扰你!:)

附言可以在这里找到一些关于在网页中使用 MS Windows 字符的更好文章此处.

We are hosting PHP apps on a Debian based LAMP installation. Everything is quite ok - performance, administrative and management wise. However being a somewhat new devs (we're still in high-school) we've run into some problems with the character encoding for Western Charsets.

After doing a lot of researches I have come to the conclusion that the information online is somewhat confusing. It's talking about Windows-1252 being ANSI and totally ISO-8859-1 compatible.

So anyway, What is the difference between Windows-1252(1/3/4) and ISO-8859-1? And where does ANSI come into this anyway?

What encoding should we use on our Debian servers (and workstations) in order to ensure that clients get all information in the intended way and that we don't lose any chars on the way?

解决方案

I'd like to answer this in a more web-like manner and in order to answer it so we need a little history. Joel Spolsky has written a very good introductionary article on the absolute minimum every dev should know on Unicode Character Encoding. Bear with me here because this is going to be somewhat of a looong answer. :)

As a history I'll point to some quotes from there: (Thank you very much Joel! :) )

The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter "A" was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes.

And all was good, assuming you were an English speaker. Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255.

So now "OEM character sets" were distributed with PCs and these were still all different and incompatible. And to our contemporary amazement - it was all fine! They didn't have the Internet back than and people rarely exchanged files between systems with different locales.

Joel goes on saying:

In fact as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages.

And this is how the "Windows Code pages" were born, eventually. They were actually "parented" by the DOS code pages. And then Unicode was born! :) and UTF-8 is "another system for storing your string of Unicode code points" and actually "every code point from 0-127 is stored in a single byte" and is the same as ASCII. I will not go into anymore specifics of Unicode and UTF-8, but you should read up on the BOM, Endianness and Character Encoding as a general.

On "the ANSI conspiracy", Microsoft actually admits the miss-labeling of Windows-1252 in a glossary of terms:

The so-called Windows character set (WinLatin1, or Windows code page 1252, to be exact) uses some of those positions for printable characters. Thus, the Windows character set is NOT identical with ISO 8859-1. The Windows character set is often called "ANSI character set", but this is SERIOUSLY MISLEADING. It has NOT been approved by ANSI.

So, ANSI when refering to Windows character sets is not ANSI-certified! :)

As Jukka pointed out (credits go to you for the nice answer )

Windows-1252 ISO Latin 1, also known as ISO-8859-1 as a character encoding, so that the code range 0x80 to 0x9F is reserved for control characters in ISO-8859-1 (so-called C1 Controls), wheres in Windows-1252, some of the codes there are assigned to printable characters (mostly punctuation characters), others are left undefined.

However my personal opinion and technical understanding is that both Windows-1252 and ISO-8859-1 ARE NOT WEB ENCODINGS! :) So:

  • For web pages please use UTF-8 as encoding for the content So store data as UTF-8 and "spit it out" with the HTTP Header: Content-Type: text/html; charset=utf-8.

    There is also a thing called the HTML content-type meta-tag: <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> Now, what browsers actually do when they encounter this tag is that they start from the beginning of the HTML document again so that they could reinterpret the document in the declared encoding. This should happen only if there is no 'Content-type' header.

  • Use other specific encodings if the users of your system need files generated from it. For example some western users may need Excel generated files, or CSVs in Windows-1252. If this is the case, encode text in that locale and then store it on the fs and serve it as a download-able file.

  • There is another thing to be aware of in the design of HTTP: The content-encoding distribution mechanism should work like this.

    I. The client requests a web page in a specific content-types and encodings via: the 'Accept' and 'Accept-Charset' request headers.

    II. Then the server (or web application) returns the content trans-coded to that encoding and character set.

This is NOT THE CASE in most modern web apps. What actually happens it that web applications serve (force the client) content as UTF-8. And this works because browsers interpret received documents based on the response headers and not on what they actually expected.

We should all go Unicode, so please, please, please use UTF-8 to distribute your content wherever possible and most of all applicable. Or else the elders of the Internet will haunt you! :)

P.S. Some more nice articles on using MS Windows characters in Web Pages can be found here and here.

这篇关于Windows-1252(1/3/4) 和 ISO-8859-1 之间的确切区别是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆